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1.  Executive  Summary 


This  report  describes  our  research  on  a  synthesis  approach  to  parallel  Software  Engineering.  The 
main  goal  of  this  project  was  to  develop  concepts  and  generic  tools  to  support  the  synthesis  of  par¬ 
allel  algorithms  from  formal  specifications,  and  to  carry  out  a  representative  sample  of  derivations 
for  a  variety  of  applications.  Our  technical  approach  is  based  on  program  transformation  technology 
which  allows  the  systematic  machine-supported  development  of  software  from  requirement  specifi¬ 
cations.  The  development  process  can  produce  highly  efficient  parallel  code  along  with  a  proof  of 
the  code’s  correctness. 

Our  approach  to  developing  parallel  software  involves  several  stages.  The  first  step  is  to  develop 
a  formal  model  of  the  problem  domain,  called  a  domain  theory.  Second,  the  constraints  of  a  par¬ 
ticular  parallel  problem  are  stated  within  a  domain  theory  as  a  problem  specification.  Finally,  an 
algorithm  is  produced  semi-automatically  by  applying  a  sequence  of  design  tactics  and  program 
transformations  to  the  problem  specification.  The  tactics  and  transformations  embody  program¬ 
ming  knowledge  about  algorithms,  data  structures,  program  optimization  techniques,  etc.  The 
result  of  the  transformation  process  is  executable  code  that  is  consistent  with  the  problem  specifi¬ 
cation. 

We  focused  on  three  application  domains  to  test  our  concepts: 


•  Batcher’s  Sort  -  We  developed  a  tactic  for  designing  divide-and-conquer  algorithms  in  the 
Specware  system.  We  used  this  tactic  to  design  a  well-known  sorting  algorithm,  called 
Batcher’s  Sort,  which  is  the  most  commonly  used  parallel  sorting  algorithm  in  current  practice. 
Our  implementation  in  Specware  used  concepts  of  unskolemization  and  ladder  construction 
that  arose  from  earlier  experience  with  the  KIDS  system. 

•  Scheduling  Real-Time  Concurrent  Processes  on  Parallel  Processors  -  Flight  avionics  is  an 
example  of  a  critical  Air  Force  software  application  that  has  at  its  core  the  scheduling  of 
real-time  concurrent  processes  on  parallel  processors.  We  exploited  our  experience  with  high- 
performance  scheduling  algorithms  to  synthesize  a  algorithm  for  solving  this  generic  problem. 
The  algorithm  itself  is  sequential  (although  see  the  next  item)  but  it  generates  a  schedule 
of  activities  running  in  a  parallel  environment  so  that  all  hard  deadlines  and  periodicity 
constraints  are  met. 

•  Limited  Discrepancy  Search  (LDS)  -  LDS  is  a  generic  tree  search  strategy  developed  at 
University  of  Oregon  that  aims  to  quickly  find  near-optimal  solutions.  It  has  the  advantage 
that  it  can  also  be  adapted  to  searching  a  tree  in  parallel.  We  implemented  LDS  in  KIDS 
as  a  global  search  program  scheme.  As  an  application,  we  derived  a  parallel  version  of  the 
scheduler  in  the  previous  item  -  thus  we  have  a  parallel  scheduler  of  parallel  processes! 
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2.  Introduction 


Advances  in  parallel  hardware  technology  have  far  outstripped  parallel  software  development  tech¬ 
nology.  As  a  result,  construction  of  parallel  programs  still  is  more  of  an  art  than  science.  The 
increase  in  computing  power  has  to  be  supplemented  by  design  methods  which  enable  us  to  struc¬ 
ture  problems  into  a  form  suitable  for  parallel  computation. 

We  argue  here  for  a  refinement  approach  to  developing  parallel  software  from  formal  specifications 
of  software  system  requirements.  A  refinement  approach  allows  automated  support  which  will  in 
turn  be  necessary  for  constructing  large-scale  complex  parallel  software  applications. 

We  first  describe  the  refinement  approach  and  then  list  various  difficulties  with  current  approaches 
to  parallel  software  engineering  and  how  the  refinement  approach  addresses  them.  The  follow¬ 
ing  steps  describe  a  refinement  approach  to  developing  a  system  to  run  on  a  parallel/distributed 
platform. 


1.  Domain  modeling  and  acquisition  of  a  requirements  specification 

A  user  gathers  and  formulates  requirements  on  the  functionality,  performance,  constraints 
on  the  target  language  and  architecture,  and  other  properties  of  the  desired  software.  The 
resulting  specification  is  a  pre-implementation  statement  of  the  nature  of  the  software  system 
and  some  constraints  on  how  it  is  to  be  implemented  and  on  what  target  platforms.  It  is 
intended  to  convey  as  few  implementation  commitments  as  possible,  giving  the  maximum  flex¬ 
ibility  for  choosing  software  architectures,  algorithms,  data  structures,  and  target  hardware 
architectures. 

2.  Initial  design  and  refinement 

During  the  design  phase,  the  user  exploits  various  taxonomies  of  design  knowledge:  tax¬ 
onomies  of  software  architectures,  algorithms  [25],  datatype  theories,  etc.  A  process  of  classi¬ 
fication  incrementally  helps  the  user  to  discern  the  intrinsic  structure  of  the  specified  problem 
[23,  22].  Exploiting  this  structure  is  the  key  to  a  well-designed  system.  The  result  of  this 
stage  is  a  high-level  design  for  the  system  with  maximal  parallelism. 

3.  Refinement  towards  particular  target  architectures 

The  user  then  begins  to  refine  the  system  design  towards  a  particular  target  architecture 
(or  family  of  architectures).  We  envision  a  taxonomy  of  parallel  architectures  that  can  be 
exploited  to  incrementally  refine  high-level  designs.  The  translation  of  a  high-level  design  to 
an  architecture  may  require  the  aggregation  of  virtual  processes  and  optimization  of  the  result; 
also,  further  committment  to  the  architecture’s  topology  affects  the  system’s  data  access  and 
communication  paths.  Each  architecture  may  also  support  some  optimizations  that  are  not 
supported  by  its  ancestor  in  the  taxonomy,  so  optimizations  are  performed  incrementally 
along  the  way.  Each  path  in  the  taxonomy  results  in  code  for  a  different  architecture.  The 
result  is  a  family  of  implementations,  each  stemming  from  (and  consistent  with)  the  same 
original  specification,  but  each  adapted  to  and  optimized  for  its  own  architecture. 

4.  System  Evolution 

Any  change  in  the  user’s  requirements  is  reflected  in  a  change  to  the  original  specification. 
This  change  can  then  be  propagated  through  the  family  tree  of  implementations,  either  by 
reusing  the  old  derivations,  or,  if  the  change  is  radical  enough,  resynthesizing  the  codes  with 
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new  design  and  optimization  decisions  at  many  points.  Such  evolution  is  inevitable  and  must 
be  supported. 


Machine  support  for  this  refinement  process  is  critical  because  of  the  tremendous  amount  of  logical 
detail  involved  at  each  step.  Fortunately,  good  progress  has  been  made  on  the  formal  foundations  of 
the  refinement  approach,  thereby  enabling  such  mechanization.  The  many  derivations  of  sequential 
programs  performed  using  the  KIDS  system  [24]  provides  some  evidence  for  the  ultimate  feasibility 
of  the  refinement  approach. 

We  now  list  several  aspects  of  parallel  software  engineering  and  how  the  refinement  approach 
addresses  them. 

1.  Reengineering  of  Legacy  Codes 

A  considerable  amount  of  effort  has  gone  into  tools  for  reengineering  legacy  sequential  pro¬ 
grams  for  vector  processors.  There  has  been  limited  success,  but  in  general  most  old  codes  use 
inherently  sequential  algorithms,  so  reengineering  can  be  more  trouble  than  it  is  worth  it 
cannot  fully  exploit  the  potential  parallelism,  because  it  is  not  even  implicitly  present  in  the 
legacy  code.  Also,  it  is  difficult  for  a  reengineering  approach  to  exploit  the  progress  made  in 
developing  new  parallel  algorithms  for  a  variety  of  basic  problems  in  image  processing  and 
scientific  computing,  for  example. 

Many  contemporary  parallel  programming  languages  are  extensions  to  established  sequential 
languages  (e.g.  FORTRAN-90,  CM-LISP)  and  are  usually  designed  with  a  particular  archi¬ 
tecture  in  mind.  Many  “parallel  programs”  are  adaptations  of  previously  written  sequential 
programs,  especially  in  the  scientific  programming  world.  The  focus  of  parallelization  efforts 
in  numerical  computing  has  been  the  optimization  of  inner  loops  and  the  identification  of 
idiomatic  patterns  of  computation  which  can  be  replaced  by  vector  operations. 

Part  of  the  goal  of  parallel  software  engineering  is  to  develop  a  methodology  and  tool-support 
that  allows  truly  portable  (or  architecture-independent)  program  designs.  Sequential  code 
should  emerge  from  such  designs  as  a  special  case  (i.e.,  executable  code  for  both  parallel  and 
sequential  architectures  should  be  emittable  from  translators/compilers).  We  believe  that 
a  programming  language  (in  which  the  programmer  lays  out  an  executable  design  in  some 
language)  is  almost  always  at  too  low  a  level  to  truly  attain  the  desired  goal. 

Only  by  pulling  back  to  the  requirements  level  can  we  have  information  about  the  desired 
codes  that  is  truly  architecture-independent.  It  is  at  this  abstract  level  that  we  can  concen¬ 
trate  on  the  essence  of  the  problem  and  expose  the  parallelism  which  is  inherent.  Design 
at  this  level  lays  out  the  essential  computation  that  is  required  to  solve  the  problem.  The 
design  need  not  concern  itself  with  the  constraints  imposed  by  a  particular  architecture,  e.g., 
boundary  conditions,  communication  topology,  etc.  These  issues  are  the  subject  of  the  rest 
of  the  development  process,  whose  task  is  to  realize  an  abstract  computation  on  a  concrete 
architecture. 

In  short,  we  believe  that  the  reengineering  of  legacy  systems  is  an  inherently  flawed  approach 
because  the  desired  parallelism  is  usually  not  even  implicitly  present  in  the  code.  Truly 
portable  designs  will  only  emerge  by  (1)  starting  with  architecture-independent  requirement 
specifications,  then  (2)  creating  a  high-level  design  with  maximal  parallelism,  then  (3)  refining 
the  design  towards  various  target  architectures. 

2.  Difficulty  of  Writing  Correct  Parallel  Codes 
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Parallel  code  tends  to  be  much  more  complex  to  understand  and  make  correct  than  sequen¬ 
tial  code.  One  reason  for  this  phenomenon  is  that  there  are  only  a  few  control  regimes  for 
sequential  computation:  if-then-else,  while-do,  repeat-until,  etc.  These  idioms  are  well  under¬ 
stood,  and  the  process  of  putting  together  these  idioms  to  produce  larger  programs  (program 
synthesis  and  transformation)  is  beginning  to  be  understood. 

For  parallel  computation,  there  is  more  variability  in  control  regimes  because  they  have  to  be 
connected  to  the  topology  of  the  underlying  architecture.  Moreover,  the  proximity  of  data  to 
processors  which  require  it,  and  the  resultant  communication  costs,  are  issues  which  have  to 
be  considered.  Current  models  of  parallel  computation  do  not  adequately  capture  all  of  these 
factors,  thus  making  parallel  programming  difficult. 

Even  sequential  code  is  a  complex  composition  of  information  about  the  application  domain, 
the  software  requirements,  software  architecture,  algorithms,  data  structures,  optimization 
techniques  and  details  of  the  target  platform.  The  extra  complexity  of  a  parallel  architecture 
further  compounds  the  problem.  The  refinement  approach  provides  a  design  process  that  sys¬ 
tematically  adds  implementation  detail  to  a  specification  in  a  way  that  preserves  consistency, 
so  that  the  final  code  is  consistent  with  the  initial  specification.  Furthermore,  we  have  seen 
examples  of  machine-generated  codes  that  begin  to  press  the  limits  of  what  human  program¬ 
mers  can  comprehend,  and  we  expect  more  in  the  future.  Also,  the  refinement  steps  provide 
a  formal  explanation  of  the  code  that  is  useful  for  subsequent  evolutionary  steps. 

3.  Many  Abstract  Models  of  Parallel  Computation 

Programming  is  usually  done  in  terms  of  a  model  that  abstracts  certain  details  of  the  underly¬ 
ing  hardware  environment.  In  the  parallel  domain,  there  is  a  diversity  of  parallel  architectures. 
The  most  common  theoretical  abstract  model  of  parallel  computation  is  the  PRAM.  Unfor¬ 
tunately,  PRAM  programs  tend  to  perform  poorly  on  existing  architectures  when  translated 
straightforwardly. 

There  are  abstract  models  that  translate  well  for  various  classes  of  architectures,  but  using 
these  models  amounts  to  a  premature  commitment  to  the  target  architecture  (and  there  is 
the  problem  of  portability).  So  there’s  a  tradeoff  between  simple  abstract  models  of  paral¬ 
lelism  and  performance  on  particular  architectures.  The  natural  tendency  in  practice  is  for 
programmers  (and  theoreticians)  to  specialize  along  architectural  lines.  It  may  be  that  there 
is  no  abstract  model  of  parallel  computation  that  can  be  automatically  compiled  onto  all 
parallel  architectures  with  acceptable  efficiency. 

The  refinement  approach  outlined  above  finesses  this  problem  by  exploiting  a  taxonomy  of 
models  (architectures)  of  parallel  computation.  Models  that  are  deeper  in  the  taxonomy 
correspond  to  a  smaller  class  of  target  architectures  and  thus  they  can  be  more  richly  struc¬ 
tured  (in  terms  of  processor  power  and  structure,  topology  and  communication  costs).  The 
refinement  process  can  exploit  the  taxonomy  to  incrementally  add  detail  about  the  target 
architecture  to  the  design. 


In  summary,  parallel  software  is  much  more  difficult  to  construct  than  sequential  software.  Auto¬ 
mated  support  will  be  necessary  for  constructing  large-scale  complex  parallel  software  applications. 
We  believe  that  software  synthesis  will  prove  to  be  the  most  economically  viable  approach  to  parallel 
software  engineering. 

In  this  project,  we  explored  various  aspects  of  the  refinement  model  and  carried  out  several  ap¬ 
plications  using  it.  In  the  following  sections  we  first  introduce  the  KIDS  system  (Section  3)  and 
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our  theoretical  model  of  algorithm  design  (Section  4).  We  then  describe  in  detail  three  parallel 
applications  that  we  generated  using  KIDS  or  Specware. 

In  Section  5  we  describe  a  tactic  for  designing  divide-and-conquer  algorithms  in  the  Specware 
system.  We  used  this  tactic  to  design  a  well-known  sorting  algorithm,  called  Batcher’s  Sort,  which 
is  the  most  commonly  used  parallel  sorting  algorithm  in  current  practice.  Our  implementation 
in  Specware  used  concepts  of  unskolemization  and  ladder  construction  that  arose  from  earlier 
experience  with  the  KIDS  system. 

In  Section  6,  we  describe  the  synthesis  a  algorithm  for  scheduling  of  real-time  concurrent  processes 
on  parallel  processors.  The  algorithm  schedules  a  set  of  activities  running  in  a  parallel  environment 
so  that  all  hard  deadlines  and  periodicity  constraints  are  met. 

In  Section  7,  we  describe  the  synthesis  of  tree  search  algorithms  using  a  novel  search  strategy  called 
Limited  Discrepancy  Search  (LDS)  which  was  developed  at  University  of  Oregon. 


3.  KIDS  model  of  program  development 


KIDS  is  a  program  transformation  system  -  one  applies  a  sequence  of  consistency-preserving  trans¬ 
formations  to  an  initial  specification  and  achieves  a  correct  and  hopefully  efficient  program  [24]. 
The  system  emphasizes  the  application  of  complex  high-level  transformations  that  perform  signif¬ 
icant  and  meaningful  actions.  From  the  user’s  point  of  view  the  system  allows  the  user  to  make 
high-level  design  decisions  like,  “design  a  divide-and-conquer  algorithm  for  that  specification”  or 
“simplify  that  expression  in  context” .  We  hope  that  decisions  at  this  level  will  be  both  intuitive  to 
the  user  and  be  high-level  enough  that  useful  programs  can  be  derived  within  a  reasonable  number 
of  steps. 

The  user  typically  goes  through  the  following  steps  in  using  KIDS  for  program  development. 

1.  Develop  a  domain  theory  -  An  application  domain  is  modeled  by  a  domain  theory  (a  collection 
of  types,  operations,  laws,  and  inference  rules).  The  domain  theory  specifies  the  concepts, 
operations,  and  relationships  that  characterize  the  application  and  supports  reasoning  about 
the  domain  via  a  deductive  inference  system.  Our  experience  has  been  that  distributive  and 
monotonicity  laws  provide  most  of  the  laws  that  are  needed  to  support  design  and  optimization 
of  code.  KIDS  has  a  theory  development  component  that  supports  the  automated  derivation 
of  various  kinds  of  laws. 

2.  Create  a  specification  -  The  user  enters  a  problem  specification  stated  in  terms  of  the  under¬ 
lying  domain  theory. 

3.  Apply  a  design  tactic  -  The  user  selects  an  algorithm  design  tactic  from  a  menu  and  applies 
it  to  a  specification.  Currently  KIDS  has  tactics  for  simple  problem  reduction  (reducing  a 
specification  to  a  library  routine),  divide-and-conquer,  global  search  (binary  search,  back¬ 
track,  branch-and-bound),  constraint  propagation,  problem  reduction  generators  (dynamic 
programming,  general  branch-and-bound,  and  game-tree  search  algorithms),  and  local  search 
(hillclimbing  algorithms). 

4.  Apply  optimizations  -  The  KIDS  system  allows  the  application  of  optimization  techniques 
such  as  expression  simplification,  partial  evaluation,  finite  differencing,  case  analysis,  and 
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other  transformations.  The  user  selects  an  optimization  method  from  a  menu  and  applies  it 
by  pointing  at  a  program  expression.  Each  of  the  optimization  methods  are  fully  automatic 
and,  with  the  exception  of  simplification  (which  is  arbitrarily  hard),  take  only  a  few  seconds. 

5.  Apply  data  type  refinements  -  The  user  can  select  implementations  for  the  high-level  data 
types  in  the  program.  Data  type  refinement  rules  carry  out  the  details  of  constructing  the 
implementation. 

6.  Compile  -  The  resulting  code  is  compiled  to  executable  form.  In  a  sense,  KIDS  can  be 
regarded  as  a  front-end  to  a  conventional  compiler. 


Actually,  the  user  is  free  to  apply  any  subset  of  the  KIDS  operations  in  any  order  -  the  above 
sequence  is  typical  of  our  experiments  in  algorithm  design. 


4.  Algorithm  Design  and  Parallelism 

4.1.  Problem  Theories 

We  briefiy  review  some  basic  concepts  from  algebra  and  logic.  A  theory  is  a  structure  (5,  S,  A) 
consisting  of  a  set  of  sort  symbols  5,  operations  over  those  sorts  E,  and  axioms  A  to  constrain  the 
meaning  of  the  operations.  A  theory  morphism  [theory  interpretation)  maps  from  the  sorts  and 
operations  of  one  theory  to  the  sorts  and  expressions  over  the  operations  of  another  theory  such 
that  the  image  of  each  source  theory  axiom  is  valid  in  the  target  theory.  A  parameterized  theory 
hcis  formal  parameters  that  are  themselves  theories  [10].  The  binding  of  actual  values  to  formal 
parameters  is  accomplished  by  a  theory  morphism.  Theory  T2  =  {52,S2,A2)  extends  (or  is  an 
extension  of)  theory  71  =  (5i,Si,  Ai)  if  Si  C  S2,  2i  Q  S2,  and  Ai  C  A2. 

Problem  theories  define  a  problem  by  specifying  a  domain  of  problem  instances  or  inputs  and  the 
notion  of  what  constitutes  a  solution  to  a  given  problem  instance.  Formally,  a  problem  theory  B 
has  the  following  structure. 


Sorts  D,R 

Operations  I  :  D  Boolean 

0  :  D  X  R  Boolean 


The  input  condition  I(x)  constrains  the  input  domain  £>.  The  output  condition  0{x,z)  describes 
the  conditions  under  which  output  domain  value  z  E  R  is  a,  feasible  solution  with  respect  to  input 
X  e  D.  Theories  of  booleans  and  sets  are  implicitly  imported.  Problems  of  finding  optimal  feasible 
solutions  can  be  treated  as  extensions  of  problem  theory  by  adding  a  cost  domain,  cost  function, 
and  ordering  on  the  cost  domain. 


4.2.  Algorithm  Theories 

An  algorithm  theory  represents  the  essential  structure  of  a  certain  class  of  algorithms  A  [25].  Algo¬ 
rithm  theory  A  extends  problem  theory  B  with  any  additional  sorts,  operators,  and  axioms  needed 
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to  support  the  correct  construction  of  an  ^  algorithm  for  B.  A  theory  morphism  from  the  algo¬ 
rithm  theory  into  some  problem  domain  theory  provides  the  problem-specific  concepts  needed  to 
construct  an  instance  of  an  A  algorithm. 

For  example,  global  search  theory  (presented  below  in  Section  5.3.2.)  extends  problem  theory  with 
the  basic  concepts  of  backtracking:  subspace  descriptors,  initial  space,  the  splitting  and  extraction 
operations,  filters,  and  so  on.  A  divide-and-conquer  theory  would  extend  problem  theory  with 
concepts  such  as  decomposition  operators  and  composition  operators  [18,  21]. 

The  key  observation  in  this  context  is  that  these  basic  algorithmic  concepts  are  naturally  parallel. 
Divide-and-conquer  algorithms  work  by  decomposing  a  hard  problem  instance  into  subproblem 
instances  that  can  be  solved  independently  (and  thus  in  parallel).  Global  search  algorithms  work 
by  exploring  a  tree  of  alternative  solution  paths  -  again  the  tree  search  can  be  easily  partitioned  over 
various  processors.  Other  algorithm  paradigms  that  we  have  studied  are  also  naturally  parallel. 

5.  Scheduling  Parallel  Tasks  with  Dependencies  in  Hard-Real-Time 
Systems 

5.1.  Introduction 

In  this  section  we  consider  the  problem  of  scheduling  real-time  tasks  to  run  in  parallel  on  a  pool  of 
processors.  In  particular,  we  focus  on  the  class  of  such  problems  encountered  in  “hard-real-time” 
systems;  that  is,  those  in  which  each  given  task  has  a  completion  deadline  that  must  be  stringently 
enforced  if  the  system  is  to  function  correctly.  One  example  of  such  a  “hard-real-time”  system 
is  given  by  the  operational  flight  program  of  an  avionics  computer  with  multiple  CPUs,  which 
must  perform  such  tasks  as  measuring  airspeed  and  altitude  periodically.  This  is  in  contrast  to 
the  type  of  scheduling  done,  for  instance,  by  the  task  dispatcher  of  a  multiprocessor  operating 
system,  where  the  arrival  times  and  durations  of  tasks  are  unpredictable,  and  where  there  may  be 
inter-task  dependencies  but  no  real-time  deadlines.  Our  derived  algorithm  performs  “pre-run-time” 
scheduling,  in  which  all  the  tasks  that  must  be  executed  within  a  given  time  frame  are  known  in 
advance,  together  with  the  amount  of  execution  time  that  will  be  required  by  each  task  (or  at 
least  an  upper  bound  if  this  time  cannot  be  predicted  exactly).  It  is  common  practice  to  use  run¬ 
time  (dispatch)  scheduling  algorithms  for  such  problems;  however,  as  Xu  and  Parnas  note  in  their 
comparison  of  various  approaches  to  this  problem, 

[f]or  satisfying  timing  constraints  in  hard-real-time  systems,  predictability  of  the  sys¬ 
tem’s  behaviour  is  the  most  important  concern;  pre-run-time  scheduling  is  often  the 
only  practical  means  of  providing  predictability  in  a  complex  system.  [29] 


Scheduling  problems  of  this  type  are  typically  NP-hard  ([9],  A5.2).  A  wide  variety  of  strategies 
have  been  used  in  algorithms  for  such  problems,  including  branch-and-bound  search  [28],  heuristic 
algorithms  [30],  approximation  methods  [11],  and  constraint-based  temporal  reasoning  [3]. 

Our  approach  was  to  treat  this  problem  as  a  special  case  of  a  more  general  class  of  problems:  namely, 
that  of  scheduling  a  type  of  resources  that  we  call  Asynchronously  Shared  Resources  (ASRs).  An 
ASR  is  a  resource  that  can  be  shared  simultaneously  by  many  users  or  tasks  whose  usage  patterns 
are  not  necessarily  synchronized.  The  resource  is  assumed  to  have  finite  capacity  and  the  tasks 
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are  assumed  to  use  the  resource  for  finite  periods.  A  typical  example  of  an  ASR  is  an  automobile 
parking  lot  having  n  parking  slots.  Users  can  come  and  go  independently,  but  a  scheduler  should 
never  assign  more  than  n  users  to  the  parking  lot  at  the  same  time.  More  generally  any  pool 
of  individual  resources  can  be  treated  as  an  ASR:  ramp  space  at  airports,  machining  tools  in 
manufacturing,  computer  processors  running  in  parallel,  fieets  of  transportation  vehicles,  personnel 
in  a  skill  pool,  etc.  Note  that  the  concept  of  an  ASR  is  more  general  than  that  of  a  pool  of  resources 
typically  considered  in  scheduling  problems:  for  instance,  power  sources  (e.g.  generators,  batteries) 
provide  examples  of  nondiscrete  ASRs. 

There  are  several  novel  aspects  to  this  work.  To  our  knowledge  we  present  the  first  solution  to 
ASR  scheduling  that  uses  constraint  propagation  over  time  windows/bounds  (however  see  [15]  for 
a  similar  solution  to  the  special  case  of  a  single/unshard  resource).  We  present  several  solutions, 
experimental  data,  and  comparative  analysis.  We  found  that  the  data  structures  necessary  to 
support  our  approach  are  complex  and  the  propagation  control  mechanisms  are  complex.  Finally, 
the  scheduling  and  constraint  propagation  algorithms  were  machine  generated. 


5.2.  The  P2irallel  Task  Scheduling  Problem 
5.2.1.  Homogeneous  processors 

Suppose  we  are  given  a  set  of  tasks  to  be  scheduled  for  execution  on  a  pool  of  homogeneous 
processors  (that  is,  they  are  identical  as  far  as  characteristics  which  might  affect  the  schedule  are 
concerned).  For  each  task  we  are  given  an  earliest  start  time,  a  latest  start  time,  a  duration,  and 
a  demand.  If  tsk  denotes  a  task,  then  let  tsk.est,  tsk.lst,  tsk.dur,  tsk. demand  denote  the  earliest 
start  time,  latest  start  time,  duration,  and  demand,  respectively,  of  tsk.  Furthermore,  let  tsk.  eft 
and  tsk.lft  denote  the  corresponding  finish  times  (where  tsk. eft  =  tsk.est  +  tsk.dur,  and  so  on).  If 
we  regard  the  pool  of  processors  as  an  Asynchronously  Shared  Resource,  the  scheduling  problem 
can  be  formulated  as  follows:  given 

1.  a  set  T  of  tasks 

2.  a  precedence  relation  ■<  over  T  (a  partial  order) 

3.  an  ASR  with  capacity  c 

find  an  assignment  of  start  times  to  each  task  that  satisfies 

1.  Precedence  Constraint  -  whenever  task  a  precedes  h  (written  a  ^b)  then  a. ft  <  b.st 

2.  ASR  Capacity  Constraint  -  at  no  time  does  the  demand  on  the  ASR  exceed  its  capacity;  i.e. 

V(t :  time)  demand{  T,  t)  <  c 

where 

demand{T,  t)  =  ^  tsk. demand 

tsk^T 

tS[tsk.st,tsk,ft) 

computes  the  aggregate  or  net  demand  of  the  tasks  in  T  at  time  t. 

The  ASR  scheduling  problem  is  easily  formulated  as  a  constraint  satisfaction  problem:  the  variables 
are  the  start  times  of  the  given  tasks,  and  the  precedence  and  capacity  constraints  restrict  the 
possible  combinations  of  start  times  that  the  tasks  can  be  assigned. 
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Figure  1:  Demand  Maps 


A  schedule  is  an  assignment  of  start  times  to  tasks.  Let  tsk.st,  tsk.ft  denote  the  start  time  and 
finish  time  respectively  of  task  tsk.  A  schedule  is  extractable  if  the  start  time  for  each  task  falls 
within  the  task’s  est-lst  window.  An  extractable  schedule  is  feasible  if  it  satisfies  the  precedence 
and  capacity  constraints.  An  extremal  extractable  schedule  is  obtained  by  systematically  choosing 
the  start  time  for  each  task  to  be  its  earliest  start  time  (or  dually  latest  start  time). 

In  the  parallel  task  scheduling  problem,  the  capacity  c  is  integral;  and  we  will  assume  that  task 
demands  are  unit  (i.  e.,  each  task  requires  only  one  processor  to  execute).  (Except  for  the  strategy 
discussed  in  Section  5.4.1.,  our  results  readily  generalize  to  nonintegral  c  and  nonunit  task  demands.) 

The  aggregate  demand  that  a  set  of  tasks  imposes  on  an  ASR  is  defined  in  terms  of  actual  start 

times  for  the  tasks.  However,  during  the  scheduling  process  we  only  have  bounds  on  the  start 

times  of  tasks.  Correspondingly,  we  can  define  bounds  on  the  aggregate  demand  which  we  call  the 
definite  and  possible  demands: 

definite— demand{  T,  t)  =  E  tskAemand 

tsk&T 

t^[tsk.lst,tsk.eft) 

possible— demand{T,  t)  =  E  tskAemand 

tsk^T 

te[tsk.est,tsk,lft) 

For  any  schedule  that  is  extractable  from  a  set  of  tasks  T  and  any  time  we  have 

definite-demand{T ,  t)  <  demand{T,  t)  <  possible-demand{T ,  t). 

For  example,  the  following  table  specifies  a  set  of  four  tasks  (with  no  precedences)  and  the  last 
column  gives  a  feasible  schedule  for  c  =  2. 


task 

est 

1st 

duration 

St 

A 

1 

4 

6 

1 

B 

4 

6 

5 

4 

C 

2 

8 

9 

7 

D 

9 

12 

5 

9 

The  corresponding  demand  maps  are  shown  in  Figure  1. 
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5.2.2.  Heterogeneous  Processors 


In  the  problem  considered  in  the  previous  section,  we  assumed  that  the  pool  of  processors  was 
homogeneous;  in  particular,  that  the  duration  of  a  task  is  independent  of  the  processor  on  which 
it  is  executed.  In  this  section  we  consider  the  problem  in  the  case  of  heterogeneous  processors, 
where  the  time  actually  required  to  execute  a  task  may  depend  on  which  processor  it  is  scheduled 
to  run.  Thus  the  problem  data  must  now  include,  for  each  processor  in  the  pool,  the  speed  of  that 
processor;  and  for  each  task  we  are  given  its  nominal  duration,  that  is,  the  time  it  would  take  to 
execute  that  task  on  a  processor  of  speed  1. 

As  before,  for  a  task  tsk,  we  let  tsk.est,  tsk.lft,  and  tsk.dur,  denote  the  earliest  start  time,  latest 
finish  time,  and  nominal  duration,  respectively,  of  tsk.  In  addition,  each  processor  proc  in  the  pool 
has  a  speed  proc.  speed.  The  scheduling  problem  for  parallel  tasks  on  heterogeneous  processors  can 
then  be  formulated  as  follows:  given 


1.  a  set  T  of  tasks 

2.  a  precedence  relation  ^  over  T  (a  partial  order) 

3.  a  set  P  of  k  processors 


find  an  assignment  of  processors  and  start  times  to  tasks  (that  is,  tsk.st  and  tsk. proc  for  each  task 
tsk)  that  satisfies 

1.  Precedence  Constraint  -  whenever  task  o  precedes  b  (written  a  -^h)  then 


where 


a.ft  <  b.st 


a.ft  =  a.st  +  {a.dur  /  a.proc.speed) 


2. 


3. 


ASR  Capacity  Constraint  -  at  no  time  does  the  number  of  tasks  being  executed  exceed  the 
number  of  processors;  i.e. 


V(i  :  time)  active{T,  t)  <  c 

where 

active{T,  t)  =  size{{tsk  eT  :te  [tsk.st,  tsk.  ft]} 

is  the  number  of  tasks  from  T  that  are  scheduled  to  execute  at  time  t; 
Time  Window  Constraint  -  for  each  task  a 

a.est  <  a.st 


and 


a.ft  <  a.lft 


The  following  is  an  example  of  a  parallel  task  scheduling  problem  with  heterogeneous  processors. 
As  in  the  earlier  example  of  scheduling  homogeneous  processors,  we  assume  that  there  are  two 
processors  to  be  scheduled  (which  we  label  1  and  2);  but  in  this  example  the  processors  will  have 
unequal  speeds,  1.4  and  0.6,  respectively.  If  we  redefine  the  capacity  c  to  be  the  sum  of  the  speeds 
of  the  individual  processors,  we  have  A:  =  2  and  c  =  2.0.  The  following  table  shows  the  tasks 
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from  the  previous  example  (with  latest  finish  times  instead  of  latest  start  times),  the  rightmost 
two  columns  give  a  feasible  schedule  for  the  execution  of  these  tasks  on  this  set  of  processors.  Note 
that  a  solution  now  requires  the  assignment  of  an  individual  processor  as  well  as  a  starting  time  to 
each  task. 


task 

est 

1ft 

duration 

st 

proc 

A 

1 

10 

6 

1 

1 

B 

4 

11 

5 

6 

1 

C 

2 

17 

9 

2 

2 

D 

9 

17 

5 

10 

1 

Processor  heterogeneity  adds  significantly  to  the  complexity  of  the  scheduling  problem.  Note,  for 
instance,  that  there  is  no  solution  to  this  problem  in  which  task  A  is  assigned  to  processor  2  (for 
then  the  actual  duration  of  A  is  10,  which  exceeds  the  size  of  the  time  window  for  A,  A.lft  -  A.est. 
For  another  example,  consider  the  same  set  of  tasks,  but  with  processor  speeds  of  1.5  and  0.5,  even 
though  we  still  have  A;  =  2  and  c  =  2.0,  no  feasible  solution  exists  in  this  case. 

The  algorithms  that  were  synthesized  for  solving  these  problems  utilized  several  different  search 
strategies.  All  were  derived  as  particular  instances  of  the  general  theory  of  global  search  algorithms. 
Before  discussing  the  problem-specific  strategies,  we  turn  to  a  description  of  the  general  theory. 

The  problem  of  finding  feasible  schedules  for  a  set  of  tasks  assigned  to  a  set  of  heterogeneous 
processors  can  be  presented  as  a  problem  theory  via  a  theory  interpretation  into  the  domain  theory 
of  ASR  scheduling:^ 

D  set{task)  x  seq{processor) 

I  i->  X{tasks, processors)  true 

R  i->  mapiprocessor,  seq{reservation)) 

0  H-)-  X[tasks, processors,  sched) 

Consistent— Task-Processor[sched) 

A  Consistent-Earliest-Start-Times{sched) 

A  Consistent-Latest-Finish-Times{sched) 

A  Consistent-Task-Predecessors{sched) 

A  Consistent-Task-Successors{sched) 

A  Available— Processors-Used{processors,  sched) 

A  Scheduled— Tasks{sched)  =  seq—to—set{tasks) 

5.3.  Theory  of  Global  Search  Algorithms 
5.3.1.  Synthesizing  a  Scheduler 

There  are  two  basic  approaches  to  computing  schedules  of  any  kind:  local  and  global.  Local  methods 
focus  on  individual  schedules  and  similarity  relationships  between  them.  Once  an  initial  schedule  is 
obtained,  it  is  iteratively  improved  by  moving  to  neighboring  structurally  similar  schedules.  Repair 
strategies  [31,  14,  2,  17],  fixpoint  iteration  [5],  and  linear  programming  algorithms  are  examples  of 
local  methods. 

^The  domain  theory  includes  definitions  for  the  types  of  task,  processor,  reservation  (a  record  comprised  of  task, 
assigned  processor,  and  start  time)  and  schedule  (a  map  from  the  set  of  processors  to  sequences  of  reservations). 
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Global  methods  focus  on  sets  of  schedules.  A  feasible  or  optimal  schedule  is  found  by  repeatedly 
splitting  an  initial  set  of  schedules  into  subsets  until  a  feasible  or  optimal  schedule  can  be  easily 
extracted.  Backtrack,  heuristic  search,  and  branch-and-bound  methods  are  all  examples  of  global 
methods.  For  this  problem,  we  applied  global  methods.  In  the  following  subsections  we  formalize 
the  notion  of  global  search  method  and  show  how  it  can  be  applied  to  synthesize  a  scheduler.  Other 
projects  taking  a  global  approach  include  ISIS  [8],  OPIS/DITOPS  [27],  and  MicroBoss  [16]  (all  at 
CMU). 


5.3.2.  Global  Search  Theory 

The  basic  idea  of  global  search  is  to  represent  and  manipulate  sets  of  candidate  solutions.  The 
principal  operations  are  to  extract  candidate  solutions  from  a  set  and  to  split  a  set  into  subsets. 
Derived  operations  include  various  filters  which  are  used  to  eliminate  sets  containing  no  feasible 
or  optimal  solutions.  Global  search  algorithms  work  as  follows:  starting  from  an  initial  set  that 
contains  all  solutions  to  the  given  problem  instance,  the  algorithm  repeatedly  extracts  solutions, 
splits  sets,  and  eliminates  sets  via  filters  until  no  sets  remain  to  be  split.  The  process  is  often 
described  as  a  tree  (or  DAG)  search  in  which  a  node  represents  a  set  of  candidates  and  an  arc 
represents  the  split  relationship  between  set  and  subset.  The  filters  serve  to  prune  off  branches  of 
the  tree  that  cannot  lead  to  solutions. 

The  sets  of  candidate  solutions  are  often  infinite  and  even  when  finite  they  are  rarely  represented 
extensionally.  Thus  global  search  algorithms  are  based  on  an  abstract  data  type  of  intensional 
representations  called  space  descriptors  (denoted  by  hatted  symbols).  In  addition  to  the  extrac¬ 
tion  and  splitting  operations  mentioned  above,  the  type  also  includes  a  predicate  satisfies  that 
determines  when  a  candidate  solution  is  in  the  set  denoted  by  a  descriptor.  Further,  there  is  a 
refinement  relation  on  spaces  that  corresponds  to  the  subset  relation  on  the  sets  denoted  by  a  pair 
of  descriptors. 

The  various  operations  in  the  abstract  data  type  of  space  descriptors  together  with  problem  speci¬ 
fication  can  be  packaged  together  as  a  theory.  Formally,  abstract  global  search  theory  (or  simply 
gs^heory)  Q  is  presented  in  Figure  2,  where  D  is  the  input  domain,  R  is  the  output  domain,  /  is  the 
input  condition,  0  is  the  output  condition,  R  is  the  type  of  space  descriptors,  /  defines  legal  space 
descriptors,  r  and  s  vary  over  descriptors,  top{x)  is  the  descriptor  of  the  initial  set  of  candidate 
solutions.  Satisfies {z,  f)  means  that  z  is  in  the  set  denoted  by  descriptor  r  or  that  z  satisfies  the 
constraints  that  f  represents,  and  Extract  {z,  f)  means  that  z  is  directly  extractable  from  r. 

The  relations  Split— Arg  and  Split— Constraint  are  used  to  determine  and  perform  splitting.  In 
particular,  if  Split— Arg{x,  f,c)  then  c  is  information  that  characterizes  (or  informs)  one  branch  of 
the  split.  Split— Constraint {x,  f,  c,  s)  means  that  s  results  from  incorporating  information  c  into  the 
descriptor  f  (with  respect  to  input  x).  Split- Arg  is  used  to  control  the  generation  of  children  of  a 
node  in  the  search  tree  and  Split— Constraint  is  used  to  specify  one  child.  Split— Constraint  can  be 
thought  of  as  a  parameterized  constraint  whose  alternative  arguments  are  supplied  by  Split— Arg. 

The  refinement  relation  f  □  s  holds  when  s  denotes  a  subset  of  the  set  denoted  by  r.  Further,  R 
together  with  □  forms  a  bounded  semilattice.  This  structure  will  play  a  crucial  role  in  constraint 
propagation  algorithms. 

Note  that  all  variables  in  the  axioms  are  assumed  to  be  universally  quantified  unless  explicitly 
specified  otherwise.  Axiom  GSO  asserts  that  the  initial  descriptor  top{x)  is  a  legal  descriptor. 
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Global  Search  Theory 


Spec  Global-Search 


Sorts  D  input  domain 

R  output  domain 
R  subspace  descriptors 

C  splitting  information 

Operations 

I  :  D  boolean 
0  :  D  X  R  boolean 
i  :  D  X  R  —>  boolean 
Satisfies  :  R  x  R  boolean 
Split- Arg  :  D  x  C  x  R  boolean 
Split— Constraint  :  DxRxCxR-^ 
Extract  \  R  x  R  boolean 
:  D  X  Rx  R  ^  boolean 
^  :  D  X  R  boolean 
O  :  D  X  R  X  R  -¥  boolean 
top  :  D  R 
bot ;  R 


input  condition 
input/output  condition 
subspace  descriptors  condition 
denotation  of  descriptors 
specifies  arguments  to  split  constraint 
boolean  parameterized  splitting  constraint 
extractor  of  solutions  from  spaces 
cutting  constraint 
cutting  constraint 
refinement  relation 
initial  space 
inconsistent  space 


Axioms 

GSO.  All  feasible  solutions  are  in  the  top  space 

A  G {^x ^  z')  — r"  Satisfiesi^z ftopi^xfj 
GSl.  All  solutions  in  a  space  are  finitely  extractable 
I{x)  A  i{x,  f) 

=>  {Satisfies{z,f)  3{s)  {  Split* {x ,  f ,  s)  A  Extract{z,s))) 

GS2.  Specification  of  Cutting  Constraint 

Satisfies{z,f)  A  0{x,z)  '^{x,z,f) 

GS3.  Definition  of  Cutting  Constraint  on  Spaces 

^{x,f)  \/{z  :  R){  Sat{z,f)  'i!{x,z,f)) 

GS4.  Definition  of  Refinement 

f  3  s  <=>  'i{z  :  R) {Satis fies{z,s)  =?>  Satisfies {z,f)) 

GS5.  {R,  □,  n,  top,  bot)  is  a  bounded  meet-semilattice  with  bot  as  universal  lower  bound. 


end  spec 


Figure  2:  Global  Search  Theory 


Axiom  GSl  asserts  that  legal  descriptors  split  into  legal  descriptors  and  that  Split  induces  a  well- 
founded  ordering  on  spaces.  Axiom  GS2  constrains  the  denotation  of  the  initial  descriptor  —  all 
feasible  solutions  are  contained  in  the  initial  space.  Axiom  GS3  gives  the  denotation  of  an  arbitrary 
descriptor  r  —  an  output  object  z  is  in  the  set  denoted  by  f  if  and  only  if  z  can  be  extracted  after 
finitely  many  applications  of  Split  to  f  where 

Split* {x ,  f ,  s)  3{k  :  Nat)  Split^ {x ,  f ,  s) 


and 

Split°{x,  f,  i)  4=^  f  =  i 

and  for  all  natural  numbers  k 

Split^'^^{x,  f,  i) 

4=4’  3{s  :  R,  i:  C)  {  Split— Arg{x,f,i)  A  Split— Constraint  {x,f,i,  s)  A  Split'^{x,s,i)). 


Axiom  GS4  asserts  that  if  r  splits  to  s  then  f  also  refines  to  s;  thus  the  refinement  relation  on 
R  is  weaker  than  the  split  relation.  We  also  need  the  axioms  that  (^,D,n)  is  a  semilattice.  For 
simplicity,  we  write  f  □  s  rather  than  the  correct  □  (x,  r,  s);  and  similarly  f  fl  s. 

For  example,  a  simple  global  search  theory  of  parallel  task  scheduling  (homogeneous  case)  has 
the  following  form.  Schedules  are  represented  as  maps  from  processors  to  sequences  of  reservations 
(where  each  reservation  includes  a  task,  earliest-start-time,  latest-finish-time,  and  actual  start  time). 
The  type  of  schedules  has  the  invariant  (or  subtype  characteristic)  that  for  each  reservation,  the 
earliest-start-time  plus  the  task  duration  is  no  later  than  the  latest-finish-time.  A  partial  schedule 
is  a  schedule  over  a  subset  of  the  given  tasks. 

The  initial  (partial)  schedule  is  just  the  empty  schedule  -  a  map  from  the  available  processors  to 
the  empty  sequence  of  reservations.  A  partial  schedule  is  extended  by  first  selecting  a  task,  task,  to 
schedule,  and  then  selecting  a  processor,  proc.  The  tuple  {task,proc)  constitutes  the  information  c  of 
Split— Arg.  Split— Constraint,  given  {task, proc),  creates  an  extended  schedule  that  has  a  reservation 
for  tsk  added  to  the  sequence  of  reservations  currently  scheduled  on  proc.  The  alternative  ways 
that  a  partial  schedule  can  be  extended  naturally  gives  rise  to  the  branching  structure  underlying 
global  search  algorithms. 

The  formal  version  of  this  global  search  theory  of  scheduling  can  be  inspected  in  the  domain  theory 
in  Appendix  C. 


5.3.3.  Pruning  Mechanisms 

When  a  partial  schedule  is  extended  it  is  possible  that  some  problem  constraints  are  violated  in 
such  a  way  that  further  extension  to  a  complete  feasible  schedule  is  impossible.  In  tree  search 
algorithms  it  is  crucial  to  detect  such  violations  as  early  as  possible. 

Pruning  tests  are  derived  in  the  following  way.  The  test 

3{z)  {Satisfies {z,r)  A  0{x,z))  (1) 

decides  whether  there  exist  any  feasible  solutions  that  are  in  the  space  denoted  by  r.  If  we  could 
decide  this  at  each  node  of  our  branching  structure  then  we  would  have  perfect  search  -  no  deadend 
branches  would  ever  be  explored.  In  practice  it  would  be  impossible  or  horribly  complex  to  compute 
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(1),  so  w6  roly  insto^-d  on  a.n  inoxponsivo  approximation  to  it.  In  fact,  if  we  approximate  (1)  by 
weakening  it  (deriving  a  necessary  condition  of  it)  we  obtain  a  sound  pruning  test.  That  is,  suppose 
we  can  derive  a  test  $(x,  r)  such  that 

3{sched)  {Satisfies {z,f)  A  0{x,z))  (2) 

By  the  contrapositive  of  (2),  if  f)  then  there  are  no  feasible  solutions  in  f,  so  we  can  eliminate 

it  from  further  processing.  A  global  search  algorithm  will  test  $  at  each  node  it  explores,  pruning 
those  nodes  where  the  test  fails. 

More  generally,  necessary  conditions  on  the  existence  of  feasible  (or  optimal)  solutions  below  a  node 
in  a  branching  structure  underlie  pruning  in  backtracking  and  the  bounding  and  dominance  tests 
of  branch- and-bound  algorithms  [19]. 

It  appears  that  the  bottleneck  analysis  advocated  in  the  constraint-directed  search  projects  at  CMU 
[7,  16]  leads  to  a  semantic  approximation  to  (1),  but  neither  a  necessary  nor  sufficient  condition. 

Such  a  heuristic  evaluation  of  a  node  is  inherently  fallible,  but  if  the  approximation  is  close  enough 
it  can  provide  good  search  control  with  relatively  little  backtracking. 

To  derive  pruning  tests  for  the  ASR  scheduling  problem,  we  instantiate  (1)  with  our  definition  of 
Satisfies  and  0  and  use  an  inference  system  to  derive  necessary  conditions.  The  resulting  tests 
are  fairly  straightforward;  of  the  7  original  feasibility  constraints,  5  yield  pruning  tests  on  partial 
schedules.  For  example,  the  partial  schedule  must  satisfy  Consistent-Task-Processov^  Corisisterit- 
Separation-on-Processor-EST,  ConsistenUSeparation-on-Processor-LFT,  Consistent-Separation-Predecessors, 
Consistent-Separation-Scheduled-Successors,  Consistent-Separation-Scheduled-to-Unscheduled-Successors, 
and  Consistent- Separation-  Unscheduled-to-  Unscheduled- Successors.  The  reader  may  note  that  com¬ 
puting  these  tests  on  partial  schedules  is  rather  expensive  and  mostly  unnecessary;  however,  later 
program  optimization  steps  reduce  these  tests  to  fast  and  irredundant  form.  For  example,  the 
second  test  will  reduce  to  checking  that,  when  we  assign  a  task  to  processor  z,  the  earliest  start 
time  of  the  newly  assigned  task  is  consistent  with  the  earliest  start  time  of  the  task  on  the  same 
processor  that  immediately  precedes  it. 

For  details  of  deriving  pruning  mechanisms  for  other  problems  see  [19,  24,  25,  20]. 


5.3.4.  Cutting  Constraints  and  Constraint  Propagation 

Constraint  propagation  is  a  more  general  technique  that  is  crucial  for  early  detection  of  infeasibility. 
We  developed  a  general  mechanism  for  deriving  constraint  propagation  code  and  applied  it  to 
scheduling. 

Each  node  in  a  backtrack  tree  can  be  viewed  as  a  data  structure  that  denotes  a  set  of  candidate 
solutions  —  in  particular  the  solutions  that  occur  in  the  subtree  rooted  at  the  node  (see  Figure  3) . 
Thus  the  root  denotes  the  set  of  all  candidate  solutions  found  in  the  tree. 

Pruning  has  the  effect  of  removing  a  node  (set  of  solutions)  from  further  consideration.  In  contrast, 
constraint  propagation  has  the  effect  of  changing  the  space  descriptor  so  that  it  denotes  a  smaller 
set  of  candidate  solutions.  The  effect  of  constraint  propagation  is  to  spread  information  through 
the  subspace  descriptor  resulting  in  a  tighter  descriptor  and  possibly  exposing  infeasibility.  Pruning 
can  be  treated  as  a  special  case  of  propagation  in  which  a  space  is  refined  to  descriptor  that  denotes 
the  empty  set  of  solutions. 
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Figure  3;  Global  Search  Subspace  and  Cutting  Constraints 


Constraint  propagation  is  based  on  the  notion  of  cutting  constraints  which  are  necessary  conditions 
^{x,z,  f)  that  a  candidate  solution  z  satisfying  f  is  feasible: 


'^{x  :  D,f  :  R,  z  :  R) {Satis fies{z,f)  A  0{x,z)  =^>  '^{x,z,f))  (3) 

See  Figures  3  and  4.  In  order  to  get  a  test  on  spaces  that  decides  whether  'F  has  been  incorporated, 
we  make  one  further  definition: 


^(a;,r)  <=4>  y{z  R) {Satisfies {z,f)  =>  '^{x,z,f))  (4) 

The  test  ^{x,  f)  holds  exactly  when  all  candidate  solutions  in  f  satisfy  and  we  say  that  f  satisfies 

The  key  question  at  this  point  is:  Given  a  descriptor  f  that  doesn't  satisfy  how  can  we  incorporate 
^  into  r?  The  answer  is  to  find  the  greatest  refinement  of  r  that  satisfies  we  say  t  incorporates 
(  into  f  if 


i  =  max^is  |  r  □  s  A  ^(a:,  s)}.  (5) 

which  asserts  that  t  is  maximal  over  the  set  of  descriptors  that  refine  s  and  satisfy  with  respect 
to  ordering  □.  We  want  f  to  be  a  refinement  of  r  so  that  all  of  the  information  in  r  is  preserved 
and  we  want  i  to  be  maximal  so  that  no  other  information  than  r  and  ^  is  incorporated  into  i. 

The  next  question  concerns  the  conditions  under  which  Formula  (5)  is  satisfiable.  Assuming  that 
^  is  a  semilattice,  we  can  use  variants  of  Tarski’s  fixpoint  theorem  (c.f.  [5]): 

Theorem  If  there  is  a  function  /  such  that 

1.  /  is  monotonic  on  R  (i.e.  s  3t  =>  f{x,s)  □  f{x,  i)) 
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Figure  4:  Pruning  and  Constraint  Propagation 


2.  /  is  deflationary  (i.e.  r  □  /(a:,  r)) 

3.  /  has  fixed-points  satisfying  ^  (i.e.  f{x,  f)  =  r  ^(a:,  r)) 

then  (1)  i  =  moaigls  |  r  □  s  A  ^(a:,  s)}  exists 

and  (2)  i  is  the  greatest  fixpoint  of  /;  i.e.  t  can  be  computed  by  iteratively  applying  /  to  r  until  a 
fixpoint  is  reached. 

The  challenge  is  to  construct  a  monotonic,  deflationary  function  whose  fixed-points  satisfy  (.  A 
general  construction  in  terms  of  global  search  theory  can  be  sketched  as  follows.  Let 


if  ({x,f) 
if  -i^(a:,r) 


The  intent  is  to  define  /  so  that  it  has  fixpoints  exactly  when  ^{x,f)  holds.  When  ^(a:,  r)  doesn’t 
hold,  then  we  know  (by  the  definition  of  ^  and  the  contrapositive  of  formula  (3))  that 

3{z  :  R) {Satis fies{z,f)  A  ->0{x,z)) 

i.e.  there  are  some  infeasible  solutions  in  the  space  described  by  r.  Ideally  -’^(a;,  r)  is  a  constructive 
assertion,  so  it  provides  information  on  which  solutions  are  infeasible  and  how  to  eliminate  them. 
In  place  of  the  ellipsis  above  we  require  a  new  descriptor  that  refines  r  (so  /  is  decreasing  on 
all  inputs),  allows  /  to  be  monotone,  and  eliminates  some  of  the  infeasible  solutions  indicated  by 
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r).  In  general  it  is  difficult  to  see  how  to  achieve  this  end  without  assuming  special  structure 
to  R  and 

We  have  identified  some  special  cases  for  which  an  analytic  procedure  can  produce  the  necessary 
iteration  function  /  from  These  special  cases  subsume  our  scheduling  applications  and  many 
related  Constraint  Satisfaction  Problems  (CSP)  problems.  Suppose  that  the  constraint  (  has  the 
form 

B{x,f):Jr  (6) 

where  B{x,  f)  is  monotonic  in  f.  We  say  that  ^  is  a.  Horn-like  constraint  by  generalization  of 
Horn  clauses  in  logic.  Notice  that  the  occurrence  of  r  on  the  right-hand  side  of  the  inequality  has 
positive  polarity  (i.e.  it  is  monotonic  in  r),  whereas  the  occurrence(s)  of  f  on  the  left-hand  side 
have  negative  polarity  (i.e.  are  antimonotonic).  If  the  constraint  were  boolean  (with  B  and  r  being 
boolean  values  and  □  being  implication),  then  this  would  be  called  a  definite  Horn  clause.  When 
our  constraints  are  Horn-like,  then  there  is  a  simple  definition  for  the  desired  function  /: 


or  equivalently 


j  f  if  B{x,  r)  □  r 

[  B{x,  f)r\r  if  -iB{x,  r)  □  r 

/(x,  f)  =  B{x,  f)  n  f. 


It  is  easy  to  check  that  /  is  monotone  in  r,  deflationary,  and  has  fixed-points  exactly  when  ^  holds. 
Therefore,  simple  iteration  of  /  will  converge  to  the  descriptor  that  incorporates  ^  into  f.  However, 
if  f  is  an  aggregate  structure  such  as  a  tuple  or  map,  then  the  changes  made  at  each  iteration  may 
be  relatively  sparse,  so  the  simple  iteration  approach  may  be  grossly  inefficient.  We  found  this 
feature  to  be  characteristic  of  scheduling  and  other  CSPs.  Our  approach  to  solving  this  problem 
is  to  focus  on  single  point  changes  and  to  exploit  dependence  analysis.  For  each  component  of  f 
we  define  a  separate  change  propagation  procedure.  The  arguments  to  a  propagation  procedure 
specify  a  change  to  the  component.  This  change  is  performed  and  then  the  change  procedures  for 
all  other  components  that  could  be  affected  by  the  change  are  invoked.  Static  dependence  analysis 
at  design-time  is  used  to  determine  which  constraints  could  be  affected  by  a  change  to  a  given 
component. 

A  program  scheme  for  global  search  with  constraint  propagation  is  presented  in  Figure  5.  The  global 
search  design  tactic  in  KIDS  first  instantiates  this  scheme,  then  invokes  a  tactic  for  synthesizing 
propagation  code  to  satisfy  the  specification  F— split— and-propagate. 

CSPs  with  Horn-like  constraints 

We  now  elaborate  the  previous  exposition  of  propagation  of  Horn-like  constraints  arising  in  CSPs. 
To  keep  matters  simple,  yet  general,  suppose  that  the  output  datatype  R  is  mapfVAR,  VALSET), 
where  VAR  is  a  type  of  variables,  and  VALSET  is  a  type  that  denotes  a  set  of  values  (this  implies 
that  all  the  variables  have  the  same  type  and  refinement  ordering),  and  the  □  relation  has  the 
form: 


f  "J  s  iff  f\riv)  3  s{v). 

V 
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Global  Search  Program  Theory 

Spec  Global- Search- Program  (T  ::  Global-Search) 

Operations 

F-initial-propagate  {x  :  D  \  I{x)) 

returns  {i  :  R  \  i  =  max-j  {I  |  top{x)  □  s  A  f{x,s)  A  ^{x,  s)}) 

F- split- and-propag  ate 
{x  ■.  D,  r  •.  R,  c  :  C 

\I{x)  A  i{x,f)  A  Split- Arg[x,f,c)  A  ^{x,f)  A  f  ^  bot) 
returns  {i  :  R  \  i  =  max^  {s  |  r  □  s  A  I{x,s) 

A  Split{x,f,c,s)  A  ^(2;,!)}) 

F-gs{x:D,  f  :  R  \  I{x)  A  /(x,f)  A  $(a;,r)) 
returns  (z  :  i?  |  0{x,z)  A  Satisfies{z,  f)) 

=  if  3{z)  {Extract{z,f)  A  0{x,z)) 

then  some{z)  {Extract {z,f)  A  0{x,z)) 
else  some[z)  3(c  :  C,  i  •.  R) 

{Split-Arg{x,  f,c) 

At  =  F-split-and-propagate{x^f,c)  A  i^bot 
A  z  =  F-gs{x,  i)) 

F  {x-.D\  I{x)) 

returns  {z  :  R  \  0{x,z)) 

=  some{z)  3(f)  (f  =  F-initial-propagate {x) 

A  i  ^  bot 
A  z  =  F-gs{x,  i)) 

end  spec 

Figure  5:  Global  Search  Program  Theory 
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Suppose  further  that  ^  is  a  conjunction  of  constraints  giving  bounds  on  the  variables: 

({x,f)  /\By{x,f)  3  f{v) 

V 

where  By{x,r)  is  monotonic  in  f.  Under  these  assumptions,  implies  that  the  bounding 

constraint  on  some  variable  v  is  violated;  i.e. 

-<By{x,f)  3  f{v). 

To  “fix”  such  a  violation  we  can  change  the  current  valset  of  v  to 

By{x,  r)  n  f{v), 


which  simultaneously  refines  f{v),  since 

f{v)  □  By{x,  f)  n  f{v) 
and  reestablishes  the  constraint  on  v,  since 

By{x,  f)  3  By{x,  f)  n  f{v). 


Let 

then,  define  /  jis: 


B{x,  r)  =  {I  u  — >  By{x,  f)  n  f{u)  I  u  G  domain{f)  |} 


f{x,  f)  =  r  n  B{x,  f) 


Constraint  propagation  is  treated  here  as  iteration  of  /  until  a  fixed-point  is  reached.  Efficiency 
requires  that  we  go  farther,  since  only  a  sparse  subset  of  the  variables  in  r  will  be  updated  at 
each  iteration.  If  we  implemented  the  iteration  on  a  vector  processor  or  SIMD  machine,  the  overall 
computation  could  be  fast,  but  wasteful  of  processors.  On  a  sequential  machine,  it  is  advantageous 
to  analyze  the  constraints  in  ^  to  infer  dependence  of  constraints  on  variables.  That  is,  if  (the  valset 
of)  variable  v  changes,  which  constraints  in  ^  could  become  violated?  This  dependence  analysis 
can  be  used  to  generate  special-purpose  propagation  code  as  follows. 

For  each  variable  v,  let  affects(v)  be  the  set  of  variables  whose  constraints  could  be  violated  by  a 
change  in  v;  more  formally,  let 


affects  {v)  =  1 1;  occurs  in  By  }. 

We  can  then  generate  a  set  of  procedures  that  carry  out  the  propagation/iteration  of  /:  For  each 
variable  v,  generate  the  following  propagation  procedure: 


PropagatCy  {x  :  D,  f  :  R,  new-valset :  VALSET 
I  I{x)  A  /(x,  f) 

A  f{v)  □  new-valset 
A  By {x,f)  3  new-valset) 

=  let  {s  :  R  =  map— shadow{f,v, new-valset)) 
if  -‘I{x,  s)  then  bot 
else 

...  for  each  variable  u  in  affects  {v) . . . 
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. . .  generate  the  following  code  block  . . . 
if  s  =  bot  then  hot 
else  (if  -i(5u(i,s)  □ 

then  s  ■(-  Propagateu{x,s,  Bu{Xi  s)  H  s{u)))] 


s 

end 


where  Tnap — shadowier .,1101) — valset)  returns  the  map  f  modified  so  that  r(u)  new  vedset. 

To  finish  up,  if  Split{x,  f,i,  s)  has  the  form 

s{u)  =  C{x,  f,  i) 

for  some  function  C  that  yields  a  refined  valset  for  variable  u,  then  we  can  satisfy  F-split-and- 
propagate  as  follows: 

F-split-and-^ropagate{x,  f,  i)  =  propagateu{f ,  C{x,  f,  i)). 

The  change  to  u  induced  in  the  call  to  propagatCu  will  in  turn  trigger  changes  to  other  variables, 
and  so  on. 

Constraint  Fropagation  for  Farallel  Task  Scheduling 

For  parallel  task  scheduling,  each  iteration  of  the  Propagate  operation  has  the  following  form:  For 
each  processor  proc  let  procif')  be  the  i-th  reservation  on  proc^  and  procijfj.task  the  task  for  that 
reservation.  Letting  esti  denote  the  current  value  of  earliest-start-time  for  that  task  and  estj^  the 
next  value  of  the  earliest-start-time  for  that  task  (with  Ifti  and  //f  •  defined  analogously),  we  have 


{esti 

esti-i  +  actual-durationi-i  (7) 

majc{tsk.est  +  tsk. actual— duration  \  tsk  :<  proc{i).task} 

( 

Ift'i  =  min  <  Ifti+i  —  actual— durationi+i  (8) 

[  mm{tsk.lft  —  tsk. actual— duration  \  proc{i).task  ^  tsk} 

Here  actual-duratioui-i  is  the  time  taken  to  execute  the  (i-l)-ih.  task  on  proc,  i.  e.,  nominal  - 
durationi/ proc. speed.  Boundary  cases  must  be  handled  appropriately. 

After  adding  a  new  reservation  to  some  trip,  the  effect  of  Propagate  will  be  to  shrink  the  {est,  I  ft) 
window  of  each  task  on  the  same  processor,  and  possibly  also  predecessor  and  successor  tasks  of  the 
newly  aded  task.  (Note  that  propagation  over  successors  requires  that  the  generated  propagation 
code  must  deal  with  the  unscheduled  tasks  as  well  as  the  partially  completed  schedule,  an  example 
of  propagation  over  heterogeneous  data  types).  If  the  size  of  the  time  window  becomes  negative  or 
zero  for  any  task,  the  partial  schedule  is  necessarily  infeasible  and  it  can  be  pruned. 
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5.4.  Scheduling  Strategies 


As  noted  previously,  we  treated  the  parallel  task  scheduling  problem  as  a  special  case  of  the  ASR 
problem.  There  are  a  variety  of  strategies  for  solving  ASRs,  all  based  on  branch-and-propagate.  In 
each  subsection  below  we  present  Horn-like  constraints  derived  from  the  ASR  capacity  constraint 
and  describe  some  data  structures  and  control  strategies  that  can  effectively  apply  them. 


5.4.1.  Discrete  strategy  —  scheduling  individual  processors 

A  discrete  strategy  assumes  that  the  capacity  bound  on  the  ASR  is  integral  and  that  each  task 
consumes  one  unit  of  capacity.  At  each  branching  point,  an  unscheduled  task  is  selected,  assigned 
to  one  of  the  k  processors,  and  finally  to  some  position  within  the  sequence  of  tasks  previously 
assigned  to  that  processor.  The  main  constraint  that  is  propagated  asserts  that  one  task  must 
finish  before  the  next  one  on  the  same  processor  can  start.  A  data  structure  that  represents  a 
map  from  processors  to  sequences  of  reservations  facilitates  this  strategy.  The  data  structure  forces 
the  satisfaction  of  the  capacity  constraint  by  construction  or,  from  another  point  of  view,  the 
residual  propagation  necessary  to  satisfy  the  capacity  constraint  is  precedence  between  consecutive 
tasks  for  each  processor. 

The  derived  constraints  are  as  follows.  Let  m  be  a  map  from  {!../:}  to  sequences  of  tasks.  From 
the  basic  constraint  (expressed  over  start  times) 


V(proc  :  integer^  index  :  integer) 

{proc  G  {1..A:} 

A  index  G  {l..size(m(proc))  —  1} 
m{proc){index).ft  <  m{proc){index  +  l)).st 

we  can  infer  (a  indexed  collection  of)  Horn-like  propagation  constraints  over  start  time  windows: 


'i(proc  :  integer,  index  :  integer) 

{proc  G  {l..fc} 

A  index  G  {l..size{m(proc))  —  1} 

m(proc) [index). eft  <  m{proc){index  +  l).est 
A  m{proc){index).lft  <  m{pr oc) [index  +  1). 1st) 


The  conjunct 


m[proc) [index). eft  <  Tn[proc)[index  +  l).est 

provides  a  bound  on  m[proc)[index  +  l).est:  whenever  m[proc) [index). eft  increases,  then  the  con¬ 
straint  may  become  invalid;  in  order  to  reestablish  the  constraint,  the  value  of  on  m[pr oc)  [index  + 
l).est  can  then  be  increased  (by  the  minimal  amount)  to  m[proc) [index),  eft.  Note  that  m[proc)[index).eft 
is  monotone  under  refinement,  since  the  earliest  start  time  of  a  task  can  only  increase.  Corre¬ 
spondingly,  the  second  conjunct  provides  an  upper  bound  on  m[proc) [index). Ift  which  decreases 
(monotonically)  whenever  m[proc)[index  -f  l).lst  decreases. 


^This,  of  course,  is  where  we  use  the  simplifying  assumption  that  the  ASR  capacity  is  an  integer 
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There  are  several  advantages  to  the  discrete  approach.  First,  unlike  the  aggregate  ASR  scheduling 
strategy  (discussed  in  5.4.2.),  the  discrete  strategy  applies  to  both  the  homogeneous  and  hetero¬ 
geneous  variants  of  the  problem.  Second,  partial  schedules  are  clear  and  definite.  And  third,  this 
constraint  has  the  property  that  extremal  extractable  schedules  are  always  feasible  (relative  to  the 
tasks  scheduled  so  far).  (We  have  also  used  this  approach  in  scheduling  parking  slots  and  ground 
crews  in  airlift  scheduling  in  the  ITAS  system  [4].) 

The  main  disadvantage  is  that  the  branching  factor  is  very  high  and  grows  as  0{kn)  where  n  is 
the  number  of  allocated  tasks.  Nevertheless,  good  heuristics  for  choosing  a  processor  and  position 
within  the  schedule  for  a  processor  can  result  in  a  fast  heuristic  algorithm  producing  good  results. 
Our  synthesized  algorithm  used  two  heuristics  (incorporated  into  the  global  search  theory  for  this 
strategy)  to  help  keep  the  branching  factor  at  a  computationally  practical  level:  First,  the  next 
task  to  be  scheduled  is  always  appended  to  the  sequence  of  tasks  already  scheduled  on  a  processor 
(instead  of  being  inserted  at  some  point  within  the  sequence).  Second,  the  processors  are  ordered  in 
such  a  way  (namely,  in  decreasing  order  of  processor  speed)  that  (other  things  being  equal)  a  task 
to  be  scheduled  is  more  likely  to  be  a.ssigned  to  a  fast  processor  than  a  slow  one;  thus  the  earlier 
branches  of  the  search  tree  tend  to  represent  partial  schedules  in  which  the  fastest  processors  run 
the  most  tasks. 


5.4.2.  Aggregate  strategies 

In  aggregate  strategies  we  treat  the  scheduling  of  an  ASR  as  a  whole,  assuming  no  internal  structure 
to  the  capacity  of  an  ASR.  In  the  context  of  the  problem  under  consideration,  this  means  that  we 
do  not  assign  a  task  to  a  specific  processor  when  it  is  scheduled;  thus  an  aggregate  strategy  can  only 
be  used  for  the  homogeneous  variant  of  the  parallel  task  scheduling  problem.  At  each  branching 
point,  an  unscheduled  task  is  selected,  and  it  is  added  to  the  partial  schedule  constructed  so  far. 
Its  addition  may  trigger  various  constraints,  in  particular  a  disjunctive  constraint  which  effectively 
does  the  branching.  Data  structures  that  represent  possible  and  definite  demand  maps  facilitate 
this  strategy.  More  details  on  aggregate  strategies  may  be  found  in  [26]. 


5.5.  Experience  with  generated  code 

We  used  KIDS  [24]  to  synthesize  a  variety  of  ASR  scheduling  algorithms,  in  particular  two  discrete 
algorithms  called  Discrete-1  and  Discrete-2,  as  described  in  Section  5.4.1..  They  differ  primarily 
in  the  variants  of  problems  that  they  solve:  Discrete-1  deals  with  the  homogeneous  case  with  no 
precedence  relationships,  Discrete-2  allows  heterogeneous  processors  and  precedence  relationships 
between  tasks.  Two  aggregate  algorithms,  called  Aggregate-1  and  Aggregate-2  were  also  synthesized. 
The  aggregate  algorithms  differ  in  their  strategies  for  splitting  and  extracting. 


5.6.  Timing  experiments 

We  ran  two  sets  of  experiments  with  the  generated  code.  In  the  first  set  we  generated  a  number 
of  random  problems  of  the  homogeneous  variant  with  no  precedence  relationships  between  tasks, 
and  compared  the  performance  of  the  generated  algorithms  which  solved  that  case  only,  namely 
Aggregate-1,  Aggregate-2,  and  Discrete-1.  In  the  second  set  we  determined  the  time  complexity  of 
algorithm  Discrete-2  on  heterogeneous  problems  with  precedence  constraints,  using  the  number  of 
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n  =  100,  capacity  =  10,  20  trials 


Figure  6;  Number  of  failures  (homogeneous  processors) 


tasks  as  the  problem  size. 

Generating  random  problems  -  heterogeneous  processors,  no  inter-task  precedence 
relations 

There  are  several  parameters  to  be  varied  in  generating  random  ASR  problems  in  this  case: 

•  c  —  ASR  capacity  (10  to  20) 

•  n  —  number  of  tasks  (from  50  to  1000) 

•  max  task  duration  —  tasks  durations  are  uniformly  distributed  over  the  range  [1  ..  max-task- 
duration]. 

•  max  window  —  the  width  of  the  est-lst  time  window  is  uniformly  distributed  over  the  range 
[1  ..  max-window] 

•  demand  density  —  varies  over  [0,1] 


The  task  generation  strategy  was  this:  generate  n  tasks  of  random  duration  (over  the  range  [1  .. 
max-task-duration]),  then  sum  up  the  durations  to  get  an  aggregate  demand  ad.  We  know  that 
ad  =  cxtdx  dd  so  given  c,  ad,  and  dd,  we  calculate  td  and  then  randomly  generate  earliest  start 
times  uniformly  over  the  range  [1  ..  td  —  tsk. duration]  for  each  task  tsk.  For  example,  if  c  =  20 

and  the  demand  density  is  dd  =  0.5,  and  aggregate  demand  is  ad  =  1000  then  the  total  duration  is 
td  =  100. 

Experimental  results  -  algorithms  Aggregate-1,  Aggregate-2,  Discrete-1 
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n  =  100,  c  =  10,  20  trials 


Figure  7:  Average  case  run-time  excluding  time-outs  (homogeneous  processors) 


The  main  variable  we  explored  was  the  average  complexity  of  each  scheduling  algorithm  as  a 
function  of  demand  density.  The  demand  density  is  the  ratio  of  aggregate  task  demand  to  aggregate 
ASR  capacity  (capacity  times  total  duration)  which  we  varied  from  0  to  1. 

As  expected,  the  data  shows  that  the  hardest  problem  instances  occur  at  an  intermediate  demand 
density  (empirically  about  0.8).  We  wanted  to  show  the  performance  of  the  algorithms  on  reason¬ 
ably  hard  task  sets,  but  that  entails  that  some  are  so  hard  that  they  take  too  long  to  solve.  We 
introduced  a  cutoff  time  of  30  seconds.  Figure  6  shows  the  percentage  of  task  sets  that  were  un- 
solvable  within  30  seconds  as  function  of  demand  density.  Figure  7  compares  the  average  run-times 
for  3  machine-synthesized  ASR  algorithms  where  we  excluded  from  the  sample  the  task  sets  that 

timed  out. 

Generating  random  problems:  heterogeneous  processors,  inter-task  precedence  rela¬ 
tions 

In  generating  random  problems  for  this  variant,  additional  parameters  can  be  varied: 


•  processor  speeds,  for  each  of  k  processors 

•  the  partial  order  on  T  which  expresses  the  precedence  relationship  between  tasks 


A  random  precedence  relationship  between  tasks  was  generated  by  adding  predecessors  or  successors 
to  tasks  selected  at  random  from  the  task  set  of  a  random  non-precedence  problem  that  was 
generated  as  described  above.  (The  probability  that  an  arbitrary  task  in  the  original  non-precedence 
problem  will  be  selected  is  referred  to  as  the  precedence  density.)  Note  that  the  complete  partial 
ordering  does  not  need  to  be  reified  in  the  problem  structure;  it  suffices  to  express  the  covering 
relationships  that  were  added  by  the  precedence  generation  process,  since  the  transitive  closure  of 
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Figure  8.  Average  case  run-time  by  number  of  tasks  (heterogeneous  processors  with  inter-task 
precedence  relations) 


the  covering  relationship  is  (in  effect)  computed  by  the  code  that  propagates  the  predecessor  and 
successor  constraints. 

Experimental  results  -  algorithm  Discrete-2 

For  this  set  of  tests  we  varied  the  number  of  tasks,  keeping  all  other  variables  constant.  Figure 
8  shows  the  rate  of  increase  of  time  as  problem  size  (expressed  as  n,  the  number  of  tasks  in  T) 
increases. 

For  these  runs  the  other  parameters  had  constant  values,  as  follows: 

•  k  (number  of  processors):  5 

•  speedi  (processors  speeds):  (2.5,2.0,1.5,1.0,0.75) 

•  max  task  duration:  10 

•  max  window  :  40 

•  precedence  density:  0.25 

•  demand  density:  0.3 

(The  randomly  generated  task  duration  and  task  window  variables  are  uniformly  distributed  over 
the  ranges  [  1  ..  max-task- duration^  and  [  1  ..  max-window\^  respectively.) 

Curve  fitting  of  the  data  in  Figure  8  shows  that  it  has  an  O(n^)  rate  of  growth  (the  graph  shows 
the  data  points  and  the  curve  y  =  0.0000022  *  +  2). 
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5.7.  Concluding  Remarks 

Our  overall  experience  with  these  algorithms  is  that  they  will  either  find  a  schedule  quickly  or  else 
take  a  very  long  time  to  complete.  Finding  a  schedule  quickly  means  that  little  or  no  backing  up 
occurs  during  search  —  mainly  descendants  and  siblings  are  ever  explored. 

From  this  experience  we  conclude  that  the  most  practical  way  to  exploit  these  algorithms  is  to 
modify  them  to  be  nonbacktracking  heuristic  relaxation  algorithms.  That  is,  we  modify  the  control 
structure  to  minimize  backtracking,  and  in  the  case  where  no  descendant  of  a  search  tree  node 
survives  propagation,  then  we  relax  some  constraint  (typically  the  latest  start  time)  on  a  task  to 
the  minimum  degree  necessary  to  allow  the  algorithm  to  proceed.  If  the  task  in  question  has  a 
hard  real-time  constraint,  then  backtracking  is  necessary.  This  way  we  build  on  the  strong,  efficient 
constraint  propagation  techniques  and  simple  control  strategy,  but  obtain  an  efficient  algorithm 
(usually  with  low-order  polynomial  time  complexity)  that  will  produce  feasible  schedules  to  a 
slightly  relaxed  problem  instance. 

The  data  presented  above  does  not,  of  course,  provide  a  complete  evaluation  of  the  effectiveness 
of  the  algorithms.  It  would  be  desirable  to  assess  the  tradeoff  between  completeness,  quality 
of  schedule,  and  runtime,  as  well  as  study  the  effect  of  varying  problem  parameters  in  addition 
to  number  of  tasks  and  demand  density,  such  as  window  size  relative  to  task  size,  precedence 
density,  number  of  processors,  and  (for  the  heterogeneous  case)  distribution  of  total  capacity  among 
processors. 


6.  Limited  Discrepancy  Search 

6.1.  Description  and  Analysis  of  LDS 

Limited  Discrepancy  Search  (LDS)  is  a  variant  of  global  search  for  traversing  a  search  tree  in 
an  order  that  is  likely  to  find  solutions  earlier  in  the  traversal  than  existing  strategies  such  as 
chronological  backtracking.  In  chronological  backtracking,  when  a  failure  is  encountered,  the  most- 
recent  choice  is  undone  and  an  alternative  choice  made;  if  that  fails  then  the  next-most-recent 
choice  is  revisited  and  so  on.  Thus  alternatives  to  choices  deep  in  the  tree  are  explored  first,  before 
those  near  the  root.  However,  for  many  problems  the  heuristics  for  making  choices  improve  in 
their  accuracy  deeper  in  the  tree  where  the  solution-  space  is  more  constrained.  Thus  the  choices 
near  the  root  of  the  tree  are  more  likely  to  be  wrong  than  those  deep  in  the  tree.  This  is  not  so 
important  if  the  entire  tree  can  be  explored  exhaustively,  but  with  reasonable-sized  problems  this 
is  rarely  practical. 

The  goal  of  the  LDS  strategy  is  to  explore  paths  in  the  choice  tree  in  the  order  of  likelihood  that 
they  will  succeed.  The  LDS  strategy  first  takes  the  path  through  the  tree  given  by  the  best  local 
choice  at  each  choice-point  (as  does  chronological  backtracking).  Next  it  considers  the  paths  of 
discrepancy  one:  these  consist  of  taking  the  best  local  choice  at  each  choice-point  except  for  one 
choice-point  where  the  second-best  choice  is  made.  If  the  tree  is  of  depth  n  then  there  are  n 
such  paths.  Next  paths  of  discrepancy  two  are  considered  and  then  discrepancy  3  etc.  A  path 
of  discrepancy  two  is  one  where  the  best  choice  is  taken  except  at  two  choice-points  where  the 
second-best  choice  is  taken.  Harvey  and  Ginsberg  only  considered  trees  with  binary  choices.  For 
an  nary  tree  one  could  also  classify  a  path  as  being  of  discrepancy  two  if  it  consisted  of  the  best 
choice  everywhere  but  one  choice-point  where  the  third-best  choice  is  taken. 
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There  are  several  options  to  parallelizing  the  LDS  scheme  that  trade  oflF  communication  cost  against 
the  cost  of  redundant  computation.  To  avoid  redundant  computation  one  spawns  a  new  process 
whenever  a  second-best  choice-point  is  taken.  Thus,  there  is  one  process  initially  which  always 
takes  the  best  choice  at  each  choice  point,  but  always  spawns  a  new  process  with  the  second-best 
choice  at  that  point.  Thereafter,  the  new  processes  always  take  the  best  choice.  In  spawning  the 
new  process  the  complete  state  of  the  scheduler  must  be  communicated  to  the  new  process,  but 
thereafter  the  processes  are  independent  except  when  they  are  finished  they  need  to  communicate 
to  find  which  has  the  best  solution.  The  state  of  the  scheduler  can  become  rather  large,  so  the 
initialization  may  be  significant.  Another  problem  with  this  parallelization  is  that  initially  there  is 
only  one  process  and  the  number  of  processes  grows  linearly  in  the  depth  of  the  tree,  so  early  on 
there  may  be  many  processors  idle. 

An  alternative  implementation  with  minimal  communication  and  idle  time  is  to  start  with  n  pro¬ 
cesses  each  of  which  explores  the  tree  from  its  root,  with  processor  i  taking  the  second-  best  choice 
at  choice-point  i  -  1  and  the  best  choice  everywhere  else.  There  is  one  process  that  always  takes 
the  best  choice,  so  for  the  others,  process  i  is  doing  redundant  work  up  to  choice-point  i  -  1.  If  each 
process  takes  constant  time  at  each  level  then  up  to  half  of  the  computation  done  is  redundant.  If 
the  computation  at  level  i  is  0(i)  then  up  to  one  third  of  the  computation  is  redundant. 

The  above  analysis  assumes  one  processor  per  process.  In  practice  there  are  likely  to  be  significantly 
fewer  processors  than  the  depth  of  the  tree.  If  we  have  m  processors  then  we  can  give  each  processor 
the  task  of  computing  n/m  paths.  In  this  case  the  redundancy  is  reduced  by  approximately  the 
same  factor. 


6.2.  Experiments  with  LDS 

We  did  experiments  to  test  the  effectiveness  of  LDS  using  randomly  generated  scheduling  problems. 
We  chose  a  problem  where  we  had  previous  experimental  experience,  that  of  Asynchronously  Shared 
Resources  (ASRs).  An  ASR  is  a  resource  shared  simultaneously  by  many  tasks,  for  example  an 
automobile  parking  lot  having  n  parking  slots. 

The  main  properties  we  explored  were  those  that  varied  as  a  function  of  demand  density.  The 
demand  density  is  the  ratio  of  aggregate  task  demand  to  aggregate  capacity  (capacity  times  total 
duration)  which  we  varied  from  0  to  1.  The  greater  the  demand  density  the  harder  it  is  to  find  a 
viable  schedule. 

There  are  several  parameters  used  in  generating  random  ASR  problems: 
c  =  ASR  capacity  (20) 
n  =  number  of  tasks  (100) 

dmax  =  tasks  durations  are  uniformly  distributed  over  the  range  [1  ..  100] 

wmax  =  the  width  of  the  earliest  to  latest  start  time  window  is  uniformly  distributed  over  the 
range  [1  ..  100] 

demand  density  =  varies  over  [0,1]. 

In  Figure  9  we  show  the  experimental  results  for  how  often  the  scheduler  was  able  to  find  a  schedule 
for  the  original  method  using  chronological  backtracking  and  the  LDS  method  with  a  discrepancy 
of  one.  These  results  clearly  show  that  LDS  is  effective  in  finding  significantly  more  solutions  when 
the  scheduling  problems  become  difficult. 
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Figure  9:  Scheduling  Success  Rate  of  Chronological  Backtracking  vs.  LDS 


7.  Parallel  Sorting  Algorithms 


In  this  section  we  apply  these  ideas  to  the  design  of  Batcher’s  Odd-Even  sort  [1]  and  discuss 
the  derivation  of  several  other  well-known  parallel  sorting  algorithms.  Most,  if  not  all.  sorting 
algorithms  can  be  derived  as  interpretations  of  the  divide-and-conquer  paradigm.  Accordingly,  we 
present  a  simplified  divide-and-conquer  theory  and  show  how  it  can  be  applied  to  design  the  sort 
algorithms  mentioned  above. 


7.1.  Derivation  of  a  Mergesort 

7.1.1.  Domain  Theory  for  Sorting 

Suppose  that  we  wish  to  sort  a  collection  of  objects  belonging  to  some  set  a.  that  is  linearly-ordered 
under  <.  Here  is  a  simple  specification  of  the  sorting  problem: 


Sort{x  :  bag{a)  \  true) 

returns{  z  :  seq{a)  |  x  =  Seq—to—bag{z)  A  Ordered{z)  ) 


Sort  takes  a  bag  (multiset)  a;  of  a  objects  and  returns  some  sequence  2;  such  that  the  following 
output  condition  holds:  the  bag  of  objects  in  sequence  z  is  the  same  as  x  and  z  must  be  ordered 
under  <.  The  predicate  true  following  the  parameter  x  is  called  the  input  condition  and  specifies 
any  constraints  on  inputs. 

In  order  to  support  this  specification  formally,  we  need  a  domain  theory  of  sorting  that  includes  the 
theory  of  sequences  and  bags,  has  the  linear-order  (a,  <)  as  a  parameter,  and  defines  the  concepts 
of  Seq—to-bag  and  Ordered.  The  following  parameterized  theory  accomplishes  these  ends: 


Theory  Sorting({a,  <)  :  linear— order) 

Imports  integer,  bag{a),  seq(a) 

Operations 

Ordered  :  seq(a)  Boolean 

Axioms 

V(5' :  seq(a))  (Ordered(S) 

V(i)(i  e  {l..length(S)  —  1}  S(i)  <  S{i  -I- 1))) 


Theorems 

Ordered{W)  =  true 

V(a  :  o)  {Ordered{[a])  =  true) 

V(yi  :  seq{a),y2  :  seq{a)) 

{Ordered{y\-H-y2)  ^  Ordered{yi) 

A  Seq—to-bag{yi)  <  Seq—to—bag{y2) 
A  Ordered{y2)) 

end— theory 
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Sorting  theory  imports  integer,  bag,  and  sequence  theory.  Sequences  are  constructed  via  (]  (empty 
sequence),  [a]  (singleton  sequence),  and  A+l-S  (concatenation).  For  example, 

[1.2.3] -hf[4,5,6]  =  [l,2,3,4,5,6]. 

Several  parallel  sorting  algorithms  are  based  on  an  alternative  set  of  constructors  which  use  inter¬ 
leaving  in  place  of  concatenation:  the  ilv  operator 

[1.2.3]  i/n  [4,5,6]  =  [1,4, 2, 5, 3, 6] 

interleaves  the  elements  of  its  arguments.  We  assume  that  the  arguments  to  ilv  have  the  same 
length,  typically  denoted  n,  and  that  it  is  defined  by 

Ailv  B  =  C^  V(i)(i  E  {l..n}  C2i-i  =  A,  A  C2i  =  Bi). 

In  Section  7.2.  we  develop  some  of  the  theory  of  sequences  based  on  the  ilv  constructor. 

Bags  have  an  analogous  set  of  constructors:  O  (empty  bag),  (singleton  bag),  and  A  lu)  S 

(associative  and  commutative  bag  union).  The  operator  Seq-to-bag  coerces  sequences  to  bags  by 
forgetting  the  ordering  implicit  in  the  sequence.  Seq-to-bag  obeys  the  following  distributive  laws: 

Seq-to-bag{W)  = 

V(a  :  a)  Seq—to-bag{[a])  =  faj 
V(yi  :  seq{a),y2  :  sey(Q;)) 

Seq-to-bag{yi-H-y2)  =  Seq-to-bag{yi)  tm  Seq-to-bag{yi) 

V(yi  :  seq{a),y2  :  sey(a)) 

Seq—to-bag{yi  ilv  y2)  =  Seq—to-bag{yi)  la)  Seq—to—bag{yi) 

In  the  sequel  we  will  omit  universal  quantifiers  whenever  it  is  possible  to  simplify  the  presentation 
without  sacrificing  clarity. 


7.1.2.  Divide-and-Conquer  Theory 

Most  sorting  algorithms  are  based  on  the  divide-and-conquer  paradigm:  If  the  input  is  primitive 
then  a  solution  is  obtained  directly,  by  simple  code.  Otherwise  a  solution  is  obtained  by  decom¬ 
posing  the  input  into  parts,  independently  solving  the  parts,  then  composing  the  results.  Program 
termination  is  guaranteed  by  requiring  that  decomposition  is  monotonic  with  respect  to  a  suitable 
well-founded  ordering.  In  this  paper  we  focus  on  divide-and-conquer  algorithms  that  have  the 
following  general  form: 

DC{xo  :  D  I  I{xo)) 
returns{  z  :  R\  0{xo,z)) 

=  if  Primitive{xo) 

then  Directly— Solve{xo) 
else  let  {xi,X2)  =  Decompose{xo) 

Compose{DC{xi),  DC{x2)) 
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We  refer  to  Decompose  as  a  decomposition  operator,  Compose  as  a  composition  operator,  Primitive 
as  a  control  predicate,  and  Directly  ~  Solve  as  a  primitive  operator. 

The  essence  of  a  divide-and-conquer  algorithm  can  be  presented  via  a  reduction  diagram: 


Xo 


Decompose 

Y 

{XI,X2) 


DC 

- >  Zq 

A 

Compose 


DC  X  DC  ' 

- >  {ZI,Z2) 


which  should  be  read  as  follows.  Given  input  xq,  an  acceptable  solution  zq  can  be  found  by 
decomposing  xq  into  two  subproblems  xi  and  X2,  solving  these  subproblem  recursively  yielding 
solutions  zi  and  Z2  respectively,  and  then  composing  zi  and  Z2  to  form  zq. 

In  the  derivations  of  this  paper  we  will  usually  ignore  the  primitive  predicate  and  Directly-Solve  op¬ 
erator  -  the  interesting  design  work  lies  in  calculating  compatible  pairs  of  Decompose  and  Compose 
operators. 

The  following  mergesort  program  is  an  instance  of  this  scheme: 


MSort{bo  ■  bag{integer)) 

returns{  z  :  seq{a)  \  x  =  Seq—to-bag{z)  A  Ordered{z)  ) 
=  if  size{bo)  <  1 

then  bo 

else  let  {61,62)  =  Split{bo) 

Merge{MSort{bi),  MSort{b2)) 


Here  Split  decomposes  a  bag  into  two  subbags  of  roughly  equal  size  and  Merge  composes  two 
sorted  sequences  to  form  a  sorted  sequence. 

The  characteristic  that  subproblems  are  solved  independently  gives  the  divide-and-conquer  notion 
its  great  potential  in  parallel  environments.  Another  aspect  of  divide-and-conquer  is  that  the 
recursive  decomposition  can  often  be  performed  implicitly,  thereby  enabling  a  purely  bottom-up 
computation.  For  example,  in  the  Mergesort  algorithm,  the  only  reason  for  the  recursive  splitting  is 
to  control  the  order  of  composition  (merging)  of  sorted  subproblem  solutions.  However  the  pattern 
of  merging  is  easily  determined  at  design-time  and  leads  to  the  usual  binary  tree  computation 
pattern. 

To  express  the  essence  of  divide-and-conquer,  we  define  a  divide-and-conquer  theory  comprised  of 
various  sorts,  function,  predicates,  and  axioms  that  assure  that  the  above  scheme  correctly  solves  a 
given  problem.  A  simplified  divide-and-conquer  theory  is  as  follows  (for  more  details  see  [18,  21]): 


Theory  Divide— and— Conquer 

Sorts  £),  R  domain  and  range  of  a  problem 

Operations 

I :  D  Boolean  input  condition 
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O  :  D  X  R-^  Boolean 
primitive  :  D  —>  Boolean 
Ooecompose  :  D  X  D  X  D  ^  Boolean 
Ocompose  iRxRxR-^  Boolean 
:  D  X  D  Boolean 
Soundness  Axiom 

O Decompose(,^Qi  ^1)  ^2) 

AO{xi,Zi)  a  0{x2,Z2) 

A  Oc<rmpose{.ZQi  2^17  212) 

=>  0{xq,Zq) 

end— theory 

The  intuitive  meaning  of  the  Soundness  Axiom  is  that  if  input  xq  decomposes  into  a  pair  of  subprob¬ 
lems  (xi,  X2),  and  zi  and  Z2  are  solutions  to  subproblems  xi  and  X2  respectively,  and  furthermore 
solutions  zi  and  22  can  be  composed  to  form  solution  20  7  then  zq  is  guaranteed  to  be  a  solution 
to  input  xo-  There  are  other  axioms  that  are  required:  well-foundedness  conditions  on  and 
admissibility  conditions  that  assure  that  Decompose  and  Compose  can  be  refined  to  total  functions 
over  their  domains.  We  ignore  these  in  order  to  concentrate  on  the  essentials  of  the  design  process. 

The  main  difficulty  in  designing  an  instance  of  the  divide-and-conquer  scheme  for  a  particular 
problem  lies  in  constructing  decomposition  and  composition  operators  that  work  together.  The 
following  is  a  simplified  version  of  a  tactic  in  [18]. 

1.  Choose  a  simple  decomposition  operator  and  well-founded  order. 

2.  Derive  the  control  predicate  based  on  the  conditions  under  which  the  decomposition  operator 
preserves  the  well-founded  order  and  produces  legal  subproblems. 

3.  Derive  the  input  and  output  conditions  of  the  composition  operator  using  the  Soundness 
Axiom  of  divide-and-conquer  theory. 

4.  Design  an  algorithm  for  the  composition  operator. 

5.  Design  an  algorithm  for  the  primitive  operator. 

Mergesort  is  derived  by  choosing  lyj  as  a  simple  (nondeterministic)  decomposition  operator.  A 
specification  for  the  well-known  merge  operation  is  derived  using  the  Soundness  Axiom. 


output  condition 
control  predicate 
output  condition  for  Decompose 
output  condition  for  Compose 
well-founded  order 


tuj  Merge 

Sort  X  Sort 

<  61, 62  > - >  <  ^1)  ^2  > 

A  similar  tactic  based  on  choosing  a  simple  composition  operator  and  then  solving  for  the  decom¬ 
position  operator  is  also  presented  in  [18].  This  tactic  can  be  used  to  derive  selection  sort  and 
quicksort-like  algorithms. 
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Deriving  the  output  condition  of  the  composition  operator  is  the  most  challenging  step  and  bears 
further  explanation.  The  Soundness  Axiom  of  divide-and-conquer  theory  relates  the  output  condi¬ 
tions  of  the  subalgorithms  to  the  output  condition  of  the  whole  divide-and-conquer  algorithm: 

ODecomposei^Oi  ^1)  ^2) 

AO{xi,Zi)  A  0(3:2,22) 

^  Ocompose('^0i  2:1, 22) 

=>  0{xo,Zo) 

For  design  purposes  this  constraint  can  be  treated  as  having  three  unknowns:  O,  O Decompose-,  and 
Ocompose-  Given  O  from  the  original  specification,  we  supply  an  expression  for  Ooecompose  then 
reason  backwards  from  the  consequent  to  an  expression  over  the  program  variables  20,  21,  and  22. 
This  derived  expression  is  taken  as  the  output  condition  of  Compose. 

Returning  to  Mergesort,  suppose  that  we  choose  lUi  as  a  simple  decomposition  operator.  To 
proceed  with  the  tactic,  we  instantiate  the  Soundness  Axiom  with  the  following  substitutions 

O Decompose  '^(^0,^1,^2)  ^*0  ~  ^1  ^2 

O  I-4-  A(6,  z)  b  —  Seq—to—bag{z)  A  Ordered{z) 

yielding 


60  =  61  im  62 

A  61  =  Seq—to-hag{z\)  A 
A  62  =  Seq—to-bag{z2)  A 

^  Ocompose{zo-i  Zl,  Z2) 

bo  =  Seq—to-bag{zo) 


Ordered{zi ) 
Ordered{z2) 

A  Ordered{zo) 


To  derive  Ocompose{zo,  zi,  Z2)  we  reason  backwards  from  the  consequent  bo  =  Seq-to~bag{zo)  A 
Ordered{zo)  toward  a  sufficient  condition  expressed  over  the  variables  {20,21,2:2}  modulo  the  as¬ 
sumptions  of  the  antecedent: 


bo  =  Seq—to-bag{zo)  A  Ordered{zo) 

using  assumption  60  =  ^1  f>2 

61  uy  62  =  Seq—to-bag{zo)  A  Ordered{zo) 

using  assumption  bi  =  Seq—to-bag{zi),  i  =  1,2 

Seq—to-bag{zi)  IB)  Seq—to-bag{z2)  =  Seq—to-bag{zo) 
A  Ordered{zo). 


This  last  expression  is  a  sufficient  condition  expressed  in  terms  of  the  variables  {zo,2i,Z2}  and  so 
we  take  it  to  be  the  output  condition  for  Compose.  In  other  words,  we  ensure  that  the  Soundness 
Axiom  holds  by  taking  this  expression  as  a  constraint  on  the  behavior  of  the  composition  operator. 
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The  input  condition  to  the  composition  operator  is  obtained  by  forward  inference  from  the  an¬ 
tecedent  of  the  soundness  axiom;  here  we  have  the  (trivial)  consequences  Ordered{zi)  and  Ordered{z<i). 
Only  consequences  expressed  in  terms  of  the  input  variables  z\  and  Z2  are  useful. 

Thus  we  have  derived  a  formal  specification  for  Compose: 


Merge{A  :  seq{integer),  B  :  seq{integer)  \  Ordered{A)  A  Ordered{B)) 
returns{  z  :  seq{integer) 

I  Seq-to-bag{A)  la)  Seq-to-bag{B)  =  Seq-to-bag(z) 

A  Ordered{z)  ). 


Merge  is  now  a  derived  concept  in  Sorting  theory.  We  later  derive  laws  for  it,  but  now  we  proceed 
to  design  an  algorithm  to  satisfy  this  specification.  The  usual  sequential  algorithm  for  merging  is 
based  on  choosing  a  simple  “cons”  composition  operator  and  deriving  a  decomposition  operator 
[18].  However  this  algorithm  is  inherently  sequential  and  requires  linear  time. 


7.2.  Batcher’s  Odd-Even  Sort 

Batcher’s  Odd-Even  sort  algorithm  [1]  is  a  mergesort  algorithm  in  which  the  merge  operator  itself 
is  a  divide-and-conquer  algorithm.  The  Odd-Even  merge  is  derived  by  choosing  a  simple  decompo¬ 
sition  operator  based  on  ilv  and  deriving  constraints  on  the  composition  operator. 

Before  proceeding  with  algorithm  design  we  need  to  develop  some  of  the  theory  of  sequences  based 
on  the  ilv  constructor.  Generally,  we  develop  a  domain  theory  by  deriving  laws  about  the  various 
concepts  of  the  domain.  In  particular  we  have  found  that  distributive,  monotonicity,  and  invariance 
laws  provide  most  of  the  laws  needed  to  support  formal  design.  This  suggests  that  we  develop  laws 
for  various  sorting  concepts,  such  as  Seq-to-bag  and  Ordered.  From  Section  7.1.  we  have 


Theorem  1.  Distributing  Seq-to-bag  over  sequence  constructors. 

1.1.  Seq-to-bag{\})  = 

1.2.  Seq—to-bag{[a])  =  faj 

1.3.  Seq-to-bag{Si  ilv  S2)  =  Seq-to-bag  {Si)  l!!J  Seq-to-bag  {S2) 

It  is  not  obvious  how  to  distribute  Ordered  over  ilv  ,  so  we  try  to  derive  it.  In  this  derivation  let 
n  denote  the  length  of  both  A  and  B. 


Ordered{A  ilv  B) 


by  definition  of  Ordered 


v(i)(2  e  {1..2n  -1}  {A  ilv  B)i  <  {A  ilv  B)i+i) 
change  of  index 

V(i)0-  G  {l..n}  {A  ilv  B)2j-i  <  {A  ilv  B)2j) 

A  V0)(i  G  {l..n  -  1}  ilv  B)2j  <  {A  ilv  B)2j+i) 

by  definition  of  ilv 

'^i3)U  ^  {l-.n}  =»  Aj  <  Bj) 

A  V0)(y  G  {l..n  -  1}  =>  Bj  <  Aj+i). 


These  last  two  conjuncts  are  similar  in  form  and  suggest  the  need  for  a  new  concept  definition  and 
perhaps  new  notation.  Suppose  we  define  A  <*  B  iff  Ai  <  Bi  ior  i  e  {1 .. .  n}.  This  allows  us  to 
express  the  first  conjunct  as  A  <*  B,  but  then  we  cannot  quite  express  the  second  concept  -  we 
need  to  generalize  to  allow  an  off'set  in  the  comparison: 

Definition  1.  A  pair  of  sequences  A  and  B  of  length  n  are  pairwise- ordered  with  offset  k,  written 
A  <l  B,  iff  Ai  <  Bi+k  for  2  G  {1 ...  n  —  A:}. 

Then  the  derivation  above  yields  the  following  simple  law 


Theorem  2.  Conditions  under  which  an  interleaved  sequence  is  Ordered. 

For  all  sequences  A,  B, 

Ordered{A  ilv  B)  4=^  A<q  B  A  B  <\  A. 

Note  that  this  definition  provides  a  proper  generalization  of  the  notion  of  orderedness: 

Theorem  3.  Ordered  as  a  diagonal  specialization  of  <*. 

For  all  sequences  S, 

Ordered{S)  S  <\  S 


Other  laws  are  easily  derived: 

Theorem  4.  Transitivity  of  <*. 

For  all  sequences  A,  B,  C  of  equal  length  and  integers  i  and  j, 
A<*  B  A  B<*C  A  C 

As  a  simple  consequence  we  have 


Corollary  1.  Only  Ordered  sequences  interleave  to  form  Ordered  sequences. 
For  all  sequences  A,  B, 

Ordered{A  ilv  B)  =>  Ordered{A)  A  Ordered{B). 
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Proof: 


Ordered{A  ilv  B) 


by  Theorem  2 
A<IB  A  B<\  A 

applying  Theorem  4  twice 
A<\A  ^  B<\B 
by  Theorem  3 

Ordered{A)  A  Ordered{B).  □ 


Theorem  5.  Monotonicity  of  <*  with  respect  to  merging. 

For  all  sequences  A\,  A2,  Bi,  and  B2  and  integers  i, 

Ai  <*i  A2  A  Bi  <*i  B2  =>  Merge{Ai,Bi)  <1^  Merge{A2,B2) 

We  can  apply  the  basic  sort  operation  sort2{x,y)  =  {min{x,y),max{x,y))  over  parallel  sequences, 
just  as  we  did  with  the  comparator  <. 


Definition  2.  Pairwise-sort  of  sequences  with  offset  k. 

Define  soTt2\{A,  B)  =  {A',B')  such  that 

(1)  for  i  <  k,  B[  =  Bi 

(2) foTi  =  l,...,n-k,  {A'i,B'i^k)  =  sort2{Ai,  Bi+k) 

(3)  for  i  >  n  —  k,  A'i  =  Ai 


For  example,  sort2J([2, 3, 8, 9],  [0, 1,4, 5])  =  ([1, 3, 5, 9],  [0, 2,4, 8]).  Laws  for  sort2l  can  be  devel¬ 
oped: 


Theorem  6.  sort2^  establishes  <jj. 

For  all  sequences  A,  B,  A',  and  B\  and  integer  k, 
scyrt2l{A,B)  =  {A',B')  A'  <%  B'. 


This  theorem  is  a  trivial  consequence  of  the  definition  of  sort2l.  The  following  theorems  give 
conditions  under  which  important  properties  of  the  domain  theory  (<*,  Ordered)  are  preserved 
under  under  the  sort2l  operation.  They  can  be  proved  using  straightforward  analysis  of  cases. 


Theorem  7.  Ordered  is  invariant  under  sort2^. 

For  all  sequences  A,  B  and  integer  k, 

Ordered{A)  A  Ordered{B)  A  sort2*f.{A,  B)  =  {A' ,  B') 
Ordered{A')  A  Ordered{B') 
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Theorem  8.  Invariance  of  A  <*  B  with  respect  to  sort2*^{A,  B) . 
For  all  sequences  A,  B  and  integers  i  and 
A<*B^  sort2l{A,B)  =  {A',B')  =i>  A'  <*  B' 


Theorem  9.  Invariance  of  A  <*  B  with  respect  to  sort2*f,{B,A). 
For  all  sequences  A,  B  and  0  <i  <  k, 


A  A  A  B  <*^1^  B  A  A<IB  A  B  A  A  sort2;(B,A)  =  (B',A') 
=!>  A'  <*  B' 


With  these  concepts  and  laws  in  hand,  we  can  proceed  to  derive  Batcher’s  Odd-Even  mergesort.  It 
can  be  derived  simply  by  choosing  to  decompose  the  inputs  to  Merge  by  uninterleaving  them. 


(j4o,Bo) 


Merge 

- >  Sq  :  seq{integer) 

A 


? 


V 

{{AuB,),{A2,B2)) 


Merge  x  Merge 


(‘S'l)  S2) 


where  ilv  ^  means  Aq  =  Ai  ilv  A2  and  Bq  =  Bi  ilv  B2.  Note  how  this  decomposition  operator 
creates  subproblems  of  roughly  the  same  size  which  provides  good  opportunities  for  parallel  com¬ 
putation.  Note  also  that  this  decomposition  operator  must  ensure  that  the  subproblems  {Ai,Bi) 
and  {A2,B2)  satisfy  the  input  conditions  of  Merge.  This  property  is  assured  by  Corollary  1. 

We  proceed  by  instantiating  the  Soundness  Axiom  as  before: 


Ao  =  Ai  ilv  A2  A  Ordered{Ao) 

A  Bo  =  Bi  ilv  B2  A  Ordered{Bo) 

A  Seq-to-bag{Si)  =  Seq-to-bag{Ai)  liU  Seq-to-bag[B2)  A  Ordered{S{) 
A  Seq-to-bag{S2)  =  Seq-to-bag{A2)  l!!)  Seq-to-bag{B2)  A  Ordered{S2) 

A  OcomposeiSoj  Si,  S2) 

Seq-to-bag{So)  =  Seq-to-bag{Ao)  i!i)  Seq-to~bag{Bo) 

A  Ordered{So) 


Constraints  on  Ocompose  are  derived  as  follows: 


Seq—to-bag{So)  =  Seq-to-bag{Ao)  IM)  Seq—to-bag{Bo)  A  Ordered{So) 
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by  assumption 


Seq—to—bag{So)  =  Seq—to—bag{Ai  ilv  A2) 

luj  Seq-to-bag[B\  ilv  B2) 


A  Ordered{So) 

< — >•  distributing  Seq—to—bag  over  ilv 

Seq-to-bag{So)  -  Seq-to-hag{A\)  l!!J  Seq-ta-bag{A2) 

i!!J  Seq—to—bag{Bi)  ly  Seq—to—bag{B2) 

A  Ordered{So) 

<;-T>  by  assumption 

Seq—to—bag{So)  =  Seq—to—bag{Si)  ly  Seq—to—bag{S2) 

A  Ordered{So)- 

The  input  conditions  on  Merge  are  derived  by  forward  inference  from  the  assumptions  above: 


Aq  =  Ai  ilv  A2  A  Ordered{Ao) 

A  Bo  Bi  ilv  B2  A  Ordered{Bo) 

AOrdered{S\)  A  Ordered{S2) 

— >  distributing  Ordered  over  ilv 

Ai  <S  A2  A  A2  <1  Ai 
A  Bi  <*o  B2  A  B2  Bi 
A  Ordered{S\)  A  Ordered{S2) 

by  monotonicity  of  <■  with  respect  to  Merge 

Si  <*o  S2  A  52  5i 
A  Ordered{Si)  A  Ordered{S2)- 


Thus  we  have  derived  the  specification 

Merge— Compose{Si  :  seq{integer),  S2  ■  seq{integer) 

I  5i  <5  S2  A  S2  <2  Si  A  Ordered{Si)  A  Ordered[S2)) 
returns{  So  :  seq{integer) 

1  Seq-to-bag{So)  =  Seq-ta-bag{Si)  lyj  Seq-to-bag{S2) 

A  Ordered{So)  )■ 

How  can  this  specification  be  satisfied?  Theorems  1.3  and  2  suggest  ilv  since  it  would  establish 
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the  output  conditions  of  Merge-Compose.  Theorem  2  requires  that  we  achieve  the  input  con¬ 
dition^  5i  S2  A  52  <l  Si  first.  But  Theorem  6  {sort2l  establishes  <l)  enables  us  to  apply 

Sort2l{S2,  Si)  in  order  to  achieve  the  second  conjunct.  Theorems  7,  8,  and  9  ensure  that  Si  <3  52 
remains  invariant.  So  Merge-Compose  is  satisfied  by  ilv  ■  sort2l{S2,  Si).  The  final  algorithm  in 
diagram  form  is 


^0 


Sort 


>  zo 
A 


lu) 


Merge 


{bi,b2) 


Sort  X  Sort 


>  {zi,Z2) 


{Ao,Bo) 


Merge 


ilv  -2 


{{Ai,Bi),{A2,B2)) 


Merge  x  Merge 


- >5o 

A 

ilv  ■  sort2\{S2,Si) 
■>  <  5i,  52  > 


To  simplify  the  analysis,  assume  that  the  input  to  Sort  has  length  n  =  2”".  Given  n  processors, 
Merge  runs  in  time  ’ 


'BMergei.Zl)  T/v/ 

erge  (n/2))  +  0(l) 

=  0{log{n)) 

since  the  decomposition  and  composition  operators  both  can  be  evaluated  in  constant  time  and  the 
recursion  goes  to  depth  0{log{n)). 

The  decomposition  operator  mj  in  Sort  is  nondeterministic.  This  is  an  advantage  at  this  stage  of 
design  since  it  allows  us  to  defer  commitments  and  make  choices  that  will  maximize  performance. 
In  this  case  the  complexity  of  Sort  is  calculated  via  the  recurrence 

Tsort{n)  =  max{Tsort{a{n)),Tsort{b{n)))  +  0{log{n)) 

which  is  optimized  by  taking  a(n)  =  h{n)  =  n/2  -  that  is,  we  split  the  input  bag  in  half.  Given 
n  processors  this  algorithm  runs  in  0{log'^{n))  time,  so  it  is  suboptimal  for  sorting.  However, 
according  to  [13],  Batcher’s  Odd-Even  sort  is  the  most  commonly  used  of  parallel  sort  algorithms. 


7.3.  Related  Sorting  Algorithms 

Several  other  parallel  sorting  algorithms  can  be  developed  using  the  techniques  above.  Batcher’s 
bitonic  sort  [1]  and  the  Periodic  Balanced  Sort  [6]  are  also  basically  mergesort  algorithms.  They  dif¬ 
fer  from  Odd- Even  sort  in  that  the  merge  operation  is  a  divide-and-conquer  based  on  concatenation 
as  the  composition  operator.  For  example,  bitonic  merge  can  be  diagrammed  as  follows: 
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{A,B) 

[id,  reverse] 

\ 

(Ao,  Bo) 

[halve,  halve]  ■  sort2Q  | 

{{Ai,Bi),{A2,B2)) 


Merge 


BMerge 


So 


So 
A 


BMerge  x  BMerge 


>  <  5i,52  > 


The  essential  fact  about  using  +H  as  a  composition  operator  is  that 

{A2,B2))  must  be  a  partition  in  the  sense  that  no  element  of  Ai  or  Bi  is  greater  than  any  el¬ 
ement  of  A2  and  B2.  The  cleverness  of  the  algorithm  lies  in  a  special  property  of  sequences 
that  allows  a  simple  operation  {sort2l  here)  to  effectively  produce  a  partition.  This  property  is 
called  “bitonicity”  for  bitonic  sort  and  “balanced”  for  the  periodic  balanced  sort.  (The  operation 
{Ao,Bo)  =  {id{A),reverse{B))  establishes  the  bitonic  property  and  decomposition  preserves  it). 
The  challenge  in  deriving  these  algorithms  lies  in  discovering  these  properties  given  that  one  wants 
a  divide-and-conquer  algorithm  based  on  -H-  as  composition.  Is  there  a  systematic  way  to  discover 
these  properties  or  must  we  rely  on  creative  invention?  Admittedly,  there  may  be  other  frameworks 
within  which  the  discovery  of  these  properties  is  easier. 

Another  well-known  parallel  sort  algorithm  is  odd-even  transposition  sort.  This  can  be  viewed  as 
a  parallel  variant  of  bubble-sort  which  in  turn  is  derivable  as  a  selection  sort  (local  search  is  used 
to  derive  the  selection  subalgorithm). 

The  ilv  constructor  for  sequences  has  many  other  applications  including  polynomial  evaluation, 
discrete  fast  fourier  transform,  and  matrix  transposition.  Butterfly  and  shuffle  networks  are  natural 
architectures  for  implementing  algorithms  based  on  ilv  [12]. 


7.4.  Concluding  Remarks 

The  Odd-Even  sort  algorithm  is  simpler  to  state  than  to  derive.  The  properties  of  a  ilv-hased 
theory  of  sequences  are  much  harder  to  understand  and  develop  than  a  concatenation-based  theory. 
However,  the  payoff  is  an  abundance  of  algorithms  with  good  parallel  properties. 


8.  Concluding  Remarks 


The  applications  described  above  span  a  range  of  ways  in  which  synthesis  technology  can  support 
parallel  software  engineering.  The  Batcher’s  sort  example  illustrates  the  synthesis  of  a  well-known 
high-performance  sorting  algorithm.  The  scheduler  of  real-time  concurrent  tasks  illustrates  the 
use  of  synthesis  thechnology  to  support  meta-synthesis:  we  synthesize  a  program  (scheduler)  that 
generates  a  program  (the  schedule  that  actually  executes  on  the  parallel  hardware).  The  final 
example  of  LDS  illustrates  how  parallel  search  strategy  knowledge  can  be  formally  captured  and 
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applied  furthermore  we  have  applied  it  to  the  parallization  of  a  meta-synthesizer  by  parallelizing 
the  scheduler  of  real-time  parallel  processes. 
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